What is the structure and dimension of input passed to neural network when training CBOW and SKIP GRAM word embedding

Question

I am confused about input passed to neural network in natural language processing (NLP) when training CBOW word embedding from scratch.  I read the paper and have some doubts.
In general neural network (NN) architecture, it is more clear that each row act's as input to neural network with d features. For example in the figure below:

X1, X2, X3 is one input, or one row of the data-frame. So here, one data point is of dimension 3 and data-frame would be like this:
X1  X2  X3
1   2   3
4   5   6
7   8   9

Is my understanding correct?
Now coming to NLP, CBOW architecture: Lets take an example to train CBOW word embeddings:
Sentence1: "I like natural processing domain."
Creating training data from above sentence, window size=1
Input                      output

(I,natural)                like
(like,processing)          natural
(natural,domain)           processing
(processing)               domain

Is the above creation of training data for CBOW architecture for window size=1 correct?
My Questions are below:
How will I pass this training data to neural network for the above figure?
If I represent every word as one-hot encoded vector of dimension equal to size of vocabulary V as input to neural network, then how should I pass 2 words at the same time of dimesion 2V as input.
Is this the way to pass the input for first training sample: I just concatanated the two input words:

Then I train the network to learn word-embeddings using cross entropy loss?
Is this the right way to pass input?
Secondly, the middle layer will give us the word embeddings for 2 input words or the target words??

Carl · Answer

Just think of it as a simple binary logistic classifier.
The data is word pairs $(w,c)$ (positive sample) extracted from a large corpus and for each of those $k$ negative samples, where a new $c$ is drawn from a noise distribution.
The model has two layers of parameters, no non-linear function between them, a sigmoid function on the output (not softmax). Input and output layers have one dimension per word and the middle layer is the dimension size (e.g. 500). For a word pair $(w, c)$, feed a one-hot vector representing $w$ and at the output representing $c$ predict 1 if it is a positive sample, 0 if negative.

What is the structure and dimension of input passed to neural network when training CBOW and SKIP GRAM word embedding

One Answer

Add your own answers!

Ask a Question