Ordering of standardization, pca, and/or tfidf for neural network

Question

I have 60k rows of text data. I have tokenized it into 55k columns. I am using a neural network to classify the data but have some questions about how to order my preprocessing steps. I have too much data for my hardware (doesn’t fit in memory/too slow) so I am using PCA to reduce dimensions.

Obviously, I need to scale before PCA. I am currently standardizing the columns, but I am wondering if I can use tfidf instead of standardization. Some rows have 50k+ tokens while others have <1k tokens so I am worried these rows have undue influence on the outcome of scaling which will trickle down the pipeline. Is this a good/bad idea? Would I maybe use tfidf then standardize before PCA?
Generally neural nets prefer standardized data. After PCA the first few columns have much greater magnitude than the rest b/c they capture so much variance. Should I standardize after PCA and before training? The reason for standardizing before training is so no feature has bigger influence on the model just b/c the scale is bigger, but isn’t PCA telling me that the first few features are actually more important? FWIW, I’ve tried both and not scaling seems a little better.
What about performing tfidf after PCA and before training? Again, rows with 50k+ tokens will prefer a network with orders of magnitude larger weights than rows with <1k tokens. Wouldn’t it be hard for the network to set weights for both types of rows?

Diagram for clarity: data -> tokenize -> ?standardize/tfidf? -> PCA -> ?standardize/tfidf? -> neural net

neural network nlp pca text classification tfidf

Julio Jesus · Accepted Answer

I would go for this:
data -> tokenize -> tfidf* -> neural net

But in tfidf vectorizer, you could actually regularize the number of terms used, say for example restricting the minimum number of occurrences of a term and/or defining the max_number of features so that you only keep the ones that have the highest importance according to Tfidf.
If you want to reduce the number of features via some decomposition technique PCA won't be adequate since the term-frequency matrix is sparse, so you could, for example, using NMF (Non-negative matrix factorization instead)
So:
data -> tokenize -> tfidf->NMF -> neural net

This time the regularization on tfidf is not necessary since you have an additional step.
In the end, metrics on CV will guide you on what the best strategy is

Ordering of standardization, pca, and/or tfidf for neural network

One Answer

Add your own answers!

Ask a Question