Non-uniform class occurances in input data for classification task - how to tackle it?

Data Science Asked by Mikołaj Wróblewski on December 5, 2020

So, I gathered political articles for my thesis, now I want to be able to classify given text. Though the classes distribution is actually crazy.

Class 1: 964 docs
Class 2: 37,020
Class 3: 640
Class 4: 2,675
Class 5: 793
Class 6: 23,160
Class 7: 2,665

Such a skewed data is obviously going to favor classes 2 and 6, though I thought about elevating the difference from last layer for classes with lower observations, is that worth a shot? Or it will actually create overfit for these classes? Unfortunately I can’t scrap more data, the websites with articles doesn’t have any more (at least now). Of course any data augmentation is not possible.

dataset multiclass classification nlp

Add your own answers!

Ask a Question

Get help from others!