Data Science Asked on February 16, 2021
This is more like a general NLP question.
What is the appropriate input to train a word embedding namely Word2Vec? Should all sentences belonging to an article be a separate document in a corpus? Or should each article be a document in said corpus?
This is just an example using python and gensim.
Corpus split by sentence:
SentenceCorpus = [["first", "sentence", "of", "the", "first", "article."],
["second", "sentence", "of", "the", "first", "article."],
["first", "sentence", "of", "the", "second", "article."],
["second", "sentence", "of", "the", "second", "article."]]
Corpus split by article:
ArticleCorpus = [["first", "sentence", "of", "the", "first", "article.",
"second", "sentence", "of", "the", "first", "article."],
["first", "sentence", "of", "the", "second", "article.",
"second", "sentence", "of", "the", "second", "article."]]
Training Word2Vec in Python:
from gensim.models import Word2Vec
wikiWord2Vec = Word2Vec(ArticleCorpus)
The answer to this question is that it depends. The primary approach is to pass in the tokenized sentences (so SentenceCorpus
in your example), but depending on what your goal is and what the corpus is you're looking at, you might want to instead use the entire article to learn the embeddings. This is something you might not know ahead of time -- so you'll have to think about how you want to evaluate the quality of the embeddings, and do some experiments to see which 'kind' of embeddings are more useful for your task(s).
Correct answer by NBartley on February 16, 2021
For the former, gensim has the Word2Vec class. For the latter, Doc2Vec.
Answered by user13684 on February 16, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP