Data Science Asked on May 9, 2021
When I use SMOTE-NC to oversample three classes of a 4-class classification problem, the Prec, Recall, and F1 metrics for minority classes are still VERY low (~3%). I have 32 categorical and 30 continuous variables in my dataset. All the categorical variables have been converted to binary columns using one-hot encoding. Also, before going for the over-sampling process, I am imputing all missing values using Iterativeimputer.
Regarding the classifiers, I am using logistic regression, random forest and XGboost. May I have your thoughts on this? Any suggestions to over-sample a multiclass and highly imbalanced dataset?
Before going through the process of oversampling, always see if the implementation of your algorithm supports assigning different weights to individual classes. The sklearn RandomForestClassifier has for example a class_weights
parameter with which you can do that. I found this method to work better than over- or undersampling.
Also, I have to add an obligatory part: if you minority classes have only very few samples so that the charachteristics of the respective classes are not well captured, there is little you can do except collecting more data.
Answered by georg-un on May 9, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP