Text preprocessing on corpus in pipeline before Gensim word2vec training

Data Science Asked by StraightUpBusta on April 19, 2021

I have a large compressed corpus, about 30gb in .txt.gz format. In raw format it can be used as input to word2vec like this:

data = gensim.models.word2vec.LineSentence(corpus)

This creates an iterator over the lines of the corpus. The next step is training:

model = gensim.models.Word2Vec(data)

I’d like to lemmatize and POS-tag the corpus before training. I am planning to use NLTK WordNetLemmatizer and NLTK POS-tagger.

How should I do this in a pipeline?

gensim nlp pipelines python word2vec

Add your own answers!

Ask a Question

Get help from others!

Recent Questions

Recent Answers

Joshua Engel on Why fry rice before boiling?
Peter Machado on Why fry rice before boiling?
Jon Church on Why fry rice before boiling?
Lex on Does Google Analytics track 404 page responses as valid page views?
haakon.io on Why fry rice before boiling?

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP