What is a better input for Word2Vec?

Question

This is more like a general NLP question. 
What is the appropriate input to train a word embedding namely Word2Vec? Should all sentences belonging to an article be a separate document in a corpus? Or should each article be a document in said corpus?
This is just an example using python and gensim.

Corpus split by sentence:

SentenceCorpus = [["first", "sentence", "of", "the", "first", "article."],
                  ["second", "sentence", "of", "the", "first", "article."],
                  ["first", "sentence", "of", "the", "second", "article."],
                  ["second", "sentence", "of", "the", "second", "article."]]

Corpus split by article:

ArticleCorpus = [["first", "sentence", "of", "the", "first", "article.",
                  "second", "sentence", "of", "the", "first", "article."],
                 ["first", "sentence", "of", "the", "second", "article.",
                  "second", "sentence", "of", "the", "second", "article."]]

Training Word2Vec in Python:

from gensim.models import Word2Vec

wikiWord2Vec = Word2Vec(ArticleCorpus)

NBartley · Accepted Answer

The answer to this question is that it depends. The primary approach is to pass in the tokenized sentences (so SentenceCorpus in your example), but depending on what your goal is and what the corpus is you're looking at, you might want to instead use the entire article to learn the embeddings. This is something you might not know ahead of time -- so you'll have to think about how you want to evaluate the quality of the embeddings, and do some experiments to see which 'kind' of embeddings are more useful for your task(s).

user13684 · Answer

For the former, gensim has the Word2Vec class. For the latter, Doc2Vec.

Zachary · Answer

As a supplementary to @NBartley's answer. To anyone come across this question.
I have tried use article/sentence as input for word2vec on Spark2.2, result as follow.

use sentence as input:

use article as input:

What is a better input for Word2Vec?

3 Answers

Add your own answers!

Ask a Question