TransWikia.com

What is a better input for Word2Vec?

Data Science Asked on February 16, 2021

This is more like a general NLP question.
What is the appropriate input to train a word embedding namely Word2Vec? Should all sentences belonging to an article be a separate document in a corpus? Or should each article be a document in said corpus?
This is just an example using python and gensim.

Corpus split by sentence:

SentenceCorpus = [["first", "sentence", "of", "the", "first", "article."],
                  ["second", "sentence", "of", "the", "first", "article."],
                  ["first", "sentence", "of", "the", "second", "article."],
                  ["second", "sentence", "of", "the", "second", "article."]]

Corpus split by article:

ArticleCorpus = [["first", "sentence", "of", "the", "first", "article.",
                  "second", "sentence", "of", "the", "first", "article."],
                 ["first", "sentence", "of", "the", "second", "article.",
                  "second", "sentence", "of", "the", "second", "article."]]

Training Word2Vec in Python:

from gensim.models import Word2Vec

wikiWord2Vec = Word2Vec(ArticleCorpus)

3 Answers

The answer to this question is that it depends. The primary approach is to pass in the tokenized sentences (so SentenceCorpus in your example), but depending on what your goal is and what the corpus is you're looking at, you might want to instead use the entire article to learn the embeddings. This is something you might not know ahead of time -- so you'll have to think about how you want to evaluate the quality of the embeddings, and do some experiments to see which 'kind' of embeddings are more useful for your task(s).

Correct answer by NBartley on February 16, 2021

For the former, gensim has the Word2Vec class. For the latter, Doc2Vec.

Answered by user13684 on February 16, 2021

As a supplementary to @NBartley's answer. To anyone come across this question. I have tried use article/sentence as input for word2vec on Spark2.2, result as follow.

use sentence as input:

enter image description here

use article as input:

enter image description here

Answered by Zachary on February 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP