Is it good practice to use SMOTE when you have a data set that has imbalanced classes when using BERT model for text classification?

Question

I had a question related to SMOTE. If you have a data set that is imbalanced, is it correct to use SMOTE when you are using BERT? I believe I read somewhere that you do not need to do this since BERT take this into account, but I'm unable to find the article where I read that. Either from your own research or experience, would you say that oversampling using SMOTE (or some other algorithm) is useful when classifying using a BERT model? Or would it be redundant/unnecessary?

Erwan · Accepted Answer

I don't know about any specific recommendation related to BERT, but my general advice is this:

Do not to systematically use oversampling when the data is imbalanced, at least not before specifically identifying performance issues caused by the imbalance. I see many questions here on DataScienceSE about solving problems which are caused by using oversampling blindly (and often wrongly).
In general resampling doesn't work well with text data, because language diversity cannot be simulated this way. There is a high risk to obtain an overfit model.

Is it good practice to use SMOTE when you have a data set that has imbalanced classes when using BERT model for text classification?

One Answer

Add your own answers!

Ask a Question