Data Science Asked by StraightUpBusta on April 19, 2021
I have a large compressed corpus, about 30gb in .txt.gz format. In raw format it can be used as input to word2vec like this:
data = gensim.models.word2vec.LineSentence(corpus)
This creates an iterator over the lines of the corpus. The next step is training:
model = gensim.models.Word2Vec(data)
I’d like to lemmatize and POS-tag the corpus before training. I am planning to use NLTK WordNetLemmatizer and NLTK POS-tagger.
How should I do this in a pipeline?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP