SMOTE-NC does not help to oversample my mixed continuous/categorical dataset

Question

When I use SMOTE-NC to oversample three classes of a 4-class classification problem, the Prec, Recall, and F1 metrics for minority classes are still VERY low (~3%). I have 32 categorical and 30 continuous variables in my dataset. All the categorical variables have been converted to binary columns using one-hot encoding. Also, before going for the over-sampling process, I am imputing all missing values using Iterativeimputer.

Regarding the classifiers, I am using logistic regression, random forest and XGboost. May I have your thoughts on this? Any suggestions to over-sample a multiclass and highly imbalanced dataset?

categorical data class imbalance smotenc

Regarding the classifiers, I am using logistic regression, random forest and XGboost. May I have your thoughts on this? Any suggestions to over-sample a multiclass and highly imbalanced dataset?

georg-un · Answer

Before going through the process of oversampling, always see if the implementation of your algorithm supports assigning different weights to individual classes. The sklearn RandomForestClassifier has for example a class_weights parameter with which you can do that. I found this method to work better than over- or undersampling.

Also, I have to add an obligatory part: if you minority classes have only very few samples so that the charachteristics of the respective classes are not well captured, there is little you can do except collecting more data.

SMOTE-NC does not help to oversample my mixed continuous/categorical dataset

One Answer

Add your own answers!

Ask a Question