How to deal with imbalanced text data

Question

I am working on a problem where I have to classify products into multiple classes (more than one) based on product descriptions. For instance:

"Tresemme shampoo and conditioner - sulfate-free" = Personal Hygiene
"Lavender-scented handwash with moisturizer" = Personal Hygiene
"Doritos Ranch flavor 18 oz mega party pack" = Snacks
"Painting and Craft kit for adults above 18" = Art and Craft

However, my training dataset is highly imbalanced. A few classes have only 10 records while there is one that has 3000 records. 50000 records overall.

Can anyone suggest any good techniques to deal with the imbalance in text data?

Thanks,
GD

BlackCurrant · Answer

I too am working on same problem, found these below links very useful in getting started on oversampling and under sampling-

https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/

https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

How to deal with imbalanced text data

One Answer

Add your own answers!

Ask a Question