TransWikia.com

Vectorize One line text data

Data Science Asked on July 21, 2021

How to vectorize one-line text data? I have used tf-idf including bigrams and trigrams but I am not able to get good results.
I have purchase order descriptions which are one-liners and I need to classify.
It is a multi-class imbalanced data and I have a small dataset to train around 700 PO descriptions. The number of classes is 7 and the class distribution is similar to exponential. One class is dominating.
My take is that TF IDF should not work since the term frequency and the IDF frequency will be very small.
Also, can we make some user-defined functions to create vectors? If yes, what should be it?

Please suggest some alternative approaches as well.

3 Answers

Check my answer to this question. Nowadays there're many pretrained embedders to choose from. They'll give you fixed-size numerical vector of features. You don't even have to go DNN way, xgboost will work just fine.

Answered by Piotr Rarus on July 21, 2021

Using bigrams and trigrams is likely to generate a high number of features, but with a small dataset the traditional approach would be to reduce the number of features. You could start by removing the least frequent words/n-grams (e.g. less than 3 occurrences), and/or use feature selection with InfoGain. It might not be very accurate but at least you avoid overfitting.

Answered by Erwan on July 21, 2021

You could alternatively use a pretrained embedder like word2vec or glove to vectorize your data into fixed length vectors.

Answered by tehem on July 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP