Data Science Asked on August 19, 2021
I am confused about input passed to neural network in natural language processing (NLP) when training CBOW
word embedding from scratch. I read the paper and have some doubts.
In general neural network (NN) architecture, it is more clear that each row act’s as input to neural network with d
features. For example in the figure below:
X1, X2, X3
is one input, or one row of the data-frame. So here, one data point is of dimension 3
and data-frame would be like this:
X1 X2 X3
1 2 3
4 5 6
7 8 9
Is my understanding correct?
Now coming to NLP
, CBOW architecture: Lets take an example to train CBOW word embeddings:
Sentence1
: "I like natural processing domain."
Creating training data from above sentence, window size=1
Input output
(I,natural) like
(like,processing) natural
(natural,domain) processing
(processing) domain
Is the above creation of training data for CBOW architecture for window size=1 correct?
My Questions are below:
How will I pass this training data to neural network for the above figure?
If I represent every word as one-hot encoded
vector of dimension equal to size of vocabulary V
as input to neural network, then how should I pass 2 words at the same time of dimesion 2V as input
.
Is this the way to pass the input for first training sample: I just concatanated the two input words:
Then I train the network to learn word-embeddings using cross entropy loss?
Is this the right way to pass input?
Secondly, the middle layer will give us the word embeddings for 2 input words or the target words??
Just think of it as a simple binary logistic classifier.
The data is word pairs $(w,c)$ (positive sample) extracted from a large corpus and for each of those $k$ negative samples, where a new $c$ is drawn from a noise distribution.
The model has two layers of parameters, no non-linear function between them, a sigmoid function on the output (not softmax). Input and output layers have one dimension per word and the middle layer is the dimension size (e.g. 500). For a word pair $(w, c)$, feed a one-hot vector representing $w$ and at the output representing $c$ predict 1 if it is a positive sample, 0 if negative.
Answered by Carl on August 19, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP