Text preprocessing on corpus in pipeline before Gensim word2vec training

Data Science Asked by StraightUpBusta on April 19, 2021

I have a large compressed corpus, about 30gb in .txt.gz format. In raw format it can be used as input to word2vec like this:

data = gensim.models.word2vec.LineSentence(corpus)

This creates an iterator over the lines of the corpus. The next step is training:

model = gensim.models.Word2Vec(data)

I’d like to lemmatize and POS-tag the corpus before training. I am planning to use NLTK WordNetLemmatizer and NLTK POS-tagger.

How should I do this in a pipeline?

Add your own answers!

Get help from others!

Recent Answers