TransWikia.com

What is the structure and dimension of input passed to neural network when training CBOW and SKIP GRAM word embedding

Data Science Asked on August 19, 2021

I am confused about input passed to neural network in natural language processing (NLP) when training CBOW word embedding from scratch. I read the paper and have some doubts.

In general neural network (NN) architecture, it is more clear that each row act’s as input to neural network with d features. For example in the figure below:

enter image description here

X1, X2, X3 is one input, or one row of the data-frame. So here, one data point is of dimension 3 and data-frame would be like this:

X1  X2  X3
1   2   3
4   5   6
7   8   9

Is my understanding correct?

Now coming to NLP, CBOW architecture: Lets take an example to train CBOW word embeddings:

Sentence1: "I like natural processing domain."

Creating training data from above sentence, window size=1

Input                      output

(I,natural)                like
(like,processing)          natural
(natural,domain)           processing
(processing)               domain

Is the above creation of training data for CBOW architecture for window size=1 correct?

My Questions are below:

How will I pass this training data to neural network for the above figure?

If I represent every word as one-hot encoded vector of dimension equal to size of vocabulary V as input to neural network, then how should I pass 2 words at the same time of dimesion 2V as input.

Is this the way to pass the input for first training sample: I just concatanated the two input words:

enter image description here

Then I train the network to learn word-embeddings using cross entropy loss?

Is this the right way to pass input?

Secondly, the middle layer will give us the word embeddings for 2 input words or the target words??

One Answer

Just think of it as a simple binary logistic classifier.

The data is word pairs $(w,c)$ (positive sample) extracted from a large corpus and for each of those $k$ negative samples, where a new $c$ is drawn from a noise distribution.

The model has two layers of parameters, no non-linear function between them, a sigmoid function on the output (not softmax). Input and output layers have one dimension per word and the middle layer is the dimension size (e.g. 500). For a word pair $(w, c)$, feed a one-hot vector representing $w$ and at the output representing $c$ predict 1 if it is a positive sample, 0 if negative.

Answered by Carl on August 19, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP