How to preprocess with NLP a big dataset for text classification

Question

TL;DR
I've never done nlp before and I feel like I'm not doing it in the good way. I'd like to know if I'm really doing things in a bad way since the beginning or there's still hope to fix those problems mentioned later.
Some basic info
I'm trying to do some binary text classification for a university task and I'm struggling at the classification because the preprocessing with NLP is not being the best.
First of all, it's important to note that I need to have efficiency in mind when designing things because I'm working with very large datasets (>1M texts) that are loaded in memory.
This datasets contains data related to new articles with title, summary, content, published_date, section, tags, authors...
Also, it's important to mention that as this task being part of a learning process I'm trying to create everything by myself instead of using external libraries (only for boring or complex tasks)
Procedure
The basic procedure for the NLP preprocessing is:

Feature extraction -> str variable with title, summary and content attributes joined in the same string
Lemmatization -> same str as input but with lemmatized words
Stopword filtering
Corpus generation -> dict object with lemmatized words as key and the index they're being inserted in the dictionary as value.

After generating the corpus with all those samples, we can finally safely vectorize them (which is basically the same process as above but without the building corpus step).
As you might guess, I'm not strictly following the basic bag of words (BOW) idea since I need to relieve memory consumption so it raises two problems when trying to work with AI algorithms like DecisionTreeClassifier from sci-kit.
Problems
Some of the problems I've observed till the moment are:

Vectors generated from those texts needs to have the same dimension Does padding them with zeroes make any sense?
Vectors for prediction needs also to have the same dimension as those from the training
At prediction phase, those words that hasn't been added to the corpus are ignored
Also, the vectorization doesn't make much sense since they are like [0, 1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3] and this is different to [1, 0, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3] even though they both contain the same information

Erwan · Accepted Answer

Let me first clarify the general principle of classification with text data. Note that I'm assuming that you're using a "traditional" method (like decision trees), as opposed to Deep Learning (DL) method.
As you correctly understand, each individual text document (instance) has to be represented as a vector of features, each feature representing a word. But there is a crucial constraint: every feature/word must be at the same position in the vector for all the documents. This is because that's how the learning algorithm can find patterns across instances. For example the decision tree algorithm might create a condition corresponding to "does the document contains the word 'cat'?", and the only way for the model to correctly detect if this condition is satisfied is if the word 'cat' is consistently represented at index $i$ in the vector for every instance.
For the record this is very similar to one-hot-encoding: the variable "word" has many possible values, each of them must be represented as a different feature.
This means that you cannot use a different index representation for every instance, as you currently do.

Vectors generated from those texts needs to have the same dimension Does padding them with zeroes make any sense?

As you probably understood now, no it doesn't.

Vectors for prediction needs also to have the same dimension as those from the training

Yes, they must not only have the same dimension but also have the same exact features/words in the same order.

At prediction phase, those words that hasn't been added to the corpus are ignored

Absolutely, any out of vocabulary word (word which doesn't appear in the training data) has to be ignored. It would be unusable anyway since the model has no idea which class it is related to.

Also, the vectorization doesn't make much sense since they are like [0, 1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3] and this is different to [1, 0, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3] even though they both contain the same information

Indeed, you had the right intuition that there was a problem there, it's the same issue as above.
Now of course you go back to solving the problem of fitting these very long vectors in memory. So in theory the vector length is the full vocabulary size, but in practice there are several good reasons not to keep all the words, more precisely to remove the least frequent words:

The least frequent words are difficult to use by the model. A word which appears only once (btw it's called a hapax legomenon, in case you want to impress people with fancy terms ;) ) doesn't help at all, because it might appear by chance with a particular class. Worse, it can cause overfitting: if the model creates a rule that classifies any document containing this word as class C (because in the training 100% of the documents with this word are class C, even though there's only one) and it turns out that the word has nothing specific to class C, the model will make errors. Statistically it's very risky to draw conclusions from a small sample, so the least frequent words are often "bad features".

You're going to like this one: texts in natural language follow a Zipf distribution. This means that in any text there's a small number of distinct words which appear frequently and a high number of distinct words which appear rarely. As a result removing the least frequent words reduces the size of the vocabulary very quickly (because there are many rare words) but it doesn't remove a large proportion of the text (because the most frequent occurrences are frequent words). For example removing the words which appear only once might reduce the vocabulary size by half, while reducing the text size by only 3%.

So practically what you need to do is this:

Calculate the word frequency for every distinct word across all the documents in the training data (only in the training data). Note that you need to store only one dict in memory so it's doable. Sort it by frequency and store it somewhere in a file.
Decide a minimum frequency $N$ in order to obtain your reduced vocabulary by removing all the words which have frequency lower than $N$.
Represent every document as a vector using only this predefined vocabulary (and fixed indexes, of course). Now you can train a model and evaluate it on a test set.

Note that you could try different values of $N$ (2,3,4,...) and observe which one gives the best performance (it's not necessarily the lowest one, for the reasons mentioned above). If you do that you should normally use a validation set distinct from the final test set, because evaluating several times on the test set is like "cheating" (this is called data leakage).

How to preprocess with NLP a big dataset for text classification

TL;DR

Some basic info

Procedure

Problems

One Answer

Add your own answers!

Ask a Question