Hashing trick for dimensionality reduction

Question

I am building a model that uses TF-IDF NLP features in Spark Mllib. The TF-IDF HashingTF function in Mllib uses the 'hashing trick' to efficiently allocate terms to features.
My question is: does the hashing trick work as an effective form of dimensionality reduction? Since I can choose the number of features generated by the IDF function, can I choose a relatively small number of features (say, only 512 or 1024 features) and be confident that the allocation will retain meaningful properties across the data? I am using bi-grams in my TF-IDF so the natural number of terms in the vocabulary will be significantly more than 1024.
Weinberger et al (2009) seems to suggest that hashing is an effective form of dimensionality reduction. However, I'm interested if fellow practitioners find that it's a good option in the real world.
Some additional context: I am finding that training is very slow and believe it's due to the large number of columns (65,536). I have tried to use ChiSqSelector but that alone is taking hours to execute so doesn't improve things. The model I want to use can take a maximum of 4096 features, so I can't skip feature reduction.

Hashing trick for dimensionality reduction

Add your own answers!

Ask a Question