Data Science Asked by gabriel garcia on August 14, 2021
I’ve never done nlp
before and I feel like I’m not doing it in the good way. I’d like to know if I’m really doing things in a bad way since the beginning or there’s still hope to fix those problems mentioned later.
I’m trying to do some binary text classification for a university task and I’m struggling at the classification because the preprocessing with NLP is not being the best.
First of all, it’s important to note that I need to have efficiency in mind when designing things because I’m working with very large datasets (>1M texts) that are loaded in memory.
This datasets contains data related to new articles with title
, summary
, content
, published_date
, section
, tags
, authors
…
Also, it’s important to mention that as this task being part of a learning process I’m trying to create everything by myself instead of using external libraries (only for boring or complex tasks)
The basic procedure for the NLP preprocessing is:
title
, summary
and content
attributes joined in the same stringdict
object with lemmatized words as key and the index they’re being inserted in the dictionary as value.After generating the corpus with all those samples, we can finally safely vectorize them (which is basically the same process as above but without the building corpus step).
As you might guess, I’m not strictly following the basic bag of words (BOW)
idea since I need to relieve memory consumption so it raises two problems when trying to work with AI algorithms like DecisionTreeClassifier
from sci-kit.
Some of the problems I’ve observed till the moment are:
Does padding them with zeroes make any sense?
[0, 1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3]
and this is different to [1, 0, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3]
even though they both contain the same informationLet me first clarify the general principle of classification with text data. Note that I'm assuming that you're using a "traditional" method (like decision trees), as opposed to Deep Learning (DL) method.
As you correctly understand, each individual text document (instance) has to be represented as a vector of features, each feature representing a word. But there is a crucial constraint: every feature/word must be at the same position in the vector for all the documents. This is because that's how the learning algorithm can find patterns across instances. For example the decision tree algorithm might create a condition corresponding to "does the document contains the word 'cat'?", and the only way for the model to correctly detect if this condition is satisfied is if the word 'cat' is consistently represented at index $i$ in the vector for every instance.
For the record this is very similar to one-hot-encoding: the variable "word" has many possible values, each of them must be represented as a different feature.
This means that you cannot use a different index representation for every instance, as you currently do.
Vectors generated from those texts needs to have the same dimension Does padding them with zeroes make any sense?
As you probably understood now, no it doesn't.
Vectors for prediction needs also to have the same dimension as those from the training
Yes, they must not only have the same dimension but also have the same exact features/words in the same order.
At prediction phase, those words that hasn't been added to the corpus are ignored
Absolutely, any out of vocabulary word (word which doesn't appear in the training data) has to be ignored. It would be unusable anyway since the model has no idea which class it is related to.
Also, the vectorization doesn't make much sense since they are like [0, 1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3] and this is different to [1, 0, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3] even though they both contain the same information
Indeed, you had the right intuition that there was a problem there, it's the same issue as above.
Now of course you go back to solving the problem of fitting these very long vectors in memory. So in theory the vector length is the full vocabulary size, but in practice there are several good reasons not to keep all the words, more precisely to remove the least frequent words:
So practically what you need to do is this:
dict
in memory so it's doable. Sort it by frequency and store it somewhere in a file.Note that you could try different values of $N$ (2,3,4,...) and observe which one gives the best performance (it's not necessarily the lowest one, for the reasons mentioned above). If you do that you should normally use a validation set distinct from the final test set, because evaluating several times on the test set is like "cheating" (this is called data leakage).
Correct answer by Erwan on August 14, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP