TransWikia.com

Is it good practice to use SMOTE when you have a data set that has imbalanced classes when using BERT model for text classification?

Data Science Asked by QMan5 on April 15, 2021

I had a question related to SMOTE. If you have a data set that is imbalanced, is it correct to use SMOTE when you are using BERT? I believe I read somewhere that you do not need to do this since BERT take this into account, but I’m unable to find the article where I read that. Either from your own research or experience, would you say that oversampling using SMOTE (or some other algorithm) is useful when classifying using a BERT model? Or would it be redundant/unnecessary?

One Answer

I don't know about any specific recommendation related to BERT, but my general advice is this:

  • Do not to systematically use oversampling when the data is imbalanced, at least not before specifically identifying performance issues caused by the imbalance. I see many questions here on DataScienceSE about solving problems which are caused by using oversampling blindly (and often wrongly).
  • In general resampling doesn't work well with text data, because language diversity cannot be simulated this way. There is a high risk to obtain an overfit model.

Correct answer by Erwan on April 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP