I need some general advice and possible ideas. Problem statement goes like this -- We are given a tweet and we have to specify associated labels for it like generalized hate, support, oppose, refutation, allegation, sarcasm. The training data is ~6k tweets. However, there is very high class imbalance. Almost 90% of classes have 0s and rest are one. The approach I have tried: Classical BoW with a generic tweet preprocessing trained using SVM, XGB, Naive Bayes etc. Although accuracy is good (due to class imbalance) but AUC is very poor ~0.52. Sophisticated techniques like LSTM, GRU with Glove embedding, BERT is performing even poorer AUC <0.49. In fact for best the classifier is predicting the same label for all test data. (I tried with Minority Oversampling too, it couldn't improve the performance either). What I figured out that the BERT vocab is not recognizing most of the words and mapping it to zero. What other approaches should I try? Any leads are appreciated.

Multilabel Tweet Classification

Cross Validated Asked by Vineet on November 12, 2021

I need some general advice and possible ideas.

Problem statement goes like this —
We are given a tweet and we have to specify associated labels for it like generalized hate, support, oppose, refutation, allegation, sarcasm.

The training data is ~6k tweets. However, there is very high class imbalance. Almost 90% of classes have 0s and rest are one.

The approach I have tried:

Classical BoW with a generic tweet preprocessing trained using SVM, XGB, Naive Bayes etc. Although accuracy is good (due to class imbalance) but AUC is very poor ~0.52.
Sophisticated techniques like LSTM, GRU with Glove embedding, BERT is performing even poorer AUC <0.49. In fact for best the classifier is predicting the same label for all test data. (I tried with Minority Oversampling too, it couldn’t improve the performance either).

What I figured out that the BERT vocab is not recognizing most of the words and mapping it to zero.

What other approaches should I try? Any leads are appreciated.

machine learning multilabel text mining unbalanced classes

Add your own answers!

Ask a Question

Get help from others!

Recent Answers

Peter Machado on Why fry rice before boiling?
haakon.io on Why fry rice before boiling?
Joshua Engel on Why fry rice before boiling?
Jon Church on Why fry rice before boiling?
Lex on Does Google Analytics track 404 page responses as valid page views?