SMOTE-NC does not help to oversample my mixed continuous/categorical dataset

Data Science Asked on May 9, 2021

When I use SMOTE-NC to oversample three classes of a 4-class classification problem, the Prec, Recall, and F1 metrics for minority classes are still VERY low (~3%). I have 32 categorical and 30 continuous variables in my dataset. All the categorical variables have been converted to binary columns using one-hot encoding. Also, before going for the over-sampling process, I am imputing all missing values using Iterativeimputer.

Regarding the classifiers, I am using logistic regression, random forest and XGboost. May I have your thoughts on this? Any suggestions to over-sample a multiclass and highly imbalanced dataset?

One Answer

Before going through the process of oversampling, always see if the implementation of your algorithm supports assigning different weights to individual classes. The sklearn RandomForestClassifier has for example a class_weights parameter with which you can do that. I found this method to work better than over- or undersampling.

Also, I have to add an obligatory part: if you minority classes have only very few samples so that the charachteristics of the respective classes are not well captured, there is little you can do except collecting more data.

Answered by georg-un on May 9, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP