Data Science Asked on July 21, 2021
How to vectorize one-line text data? I have used tf-idf including bigrams and trigrams but I am not able to get good results.
I have purchase order descriptions which are one-liners and I need to classify.
It is a multi-class imbalanced data and I have a small dataset to train around 700 PO descriptions. The number of classes is 7 and the class distribution is similar to exponential. One class is dominating.
My take is that TF IDF should not work since the term frequency and the IDF frequency will be very small.
Also, can we make some user-defined functions to create vectors? If yes, what should be it?
Please suggest some alternative approaches as well.
Check my answer to this question. Nowadays there're many pretrained embedders to choose from. They'll give you fixed-size numerical vector of features. You don't even have to go DNN way, xgboost
will work just fine.
Answered by Piotr Rarus on July 21, 2021
Using bigrams and trigrams is likely to generate a high number of features, but with a small dataset the traditional approach would be to reduce the number of features. You could start by removing the least frequent words/n-grams (e.g. less than 3 occurrences), and/or use feature selection with InfoGain. It might not be very accurate but at least you avoid overfitting.
Answered by Erwan on July 21, 2021
You could alternatively use a pretrained embedder like word2vec or glove to vectorize your data into fixed length vectors.
Answered by tehem on July 21, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP