Data Science Asked on April 6, 2021
I am doing a NLP binary classification task, using Bert + softmax layer on top of it. The network uses cross-entropy loss.
When the ratio of positive class to negative class is 1:1 or 1:2, the model performs well on correctly classifying both classes(accuracy for each class is around 0.92).
When the ratio is 1:3 to 1:10, the model performs poorly as expected. When the ratio is 1:10, the model has a 0.98 accuracy on correctly classifying negative class instances, but only has a 0.80 accuracy on correctly classifying positive class instances.
The behavior is as expected as the model turns to classify most/all instances toward negative class since the ratio of positive class to negative class is 1:10.
I just want to ask what’s the recommended way for handling this kind of class imbalance problem in natural language processing specifically?
I saw someone suggests to change loss function, or perform up/down sampling, but most of them are targetting computer vision class imbalance problem.
Disclaimer: this answer might be disappointing ;)
In general my advice would be to carefully analyze the errors that the model makes and try to make the model deal with these cases better. This can involve many different strategies depending on the task and the data. Here are a few general directions to consider:
Overall my old-school advice is not to rely too much on technical answers such as resampling methods. It can make sense sometimes, but it shouldn't be used as some kind if magical answer instead of careful analysis.
Correct answer by Erwan on April 6, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP