TransWikia.com

oversampling plus down sampling using smote not working on random forests

Data Science Asked on December 18, 2020

I am trying to solve a classification problem on a highly imbalanced data set. I am using SMOTE to over sample the minority samples and down sample the majority ones. After creating a balanced data set, I applied the random forest model. But, the prediction error for the minority class is extremely high even after using a balanced data set. What could possibly be going wrong?

library(DMwR)
new.data <- SMOTE(Clicked ~ ., train, perc.over = 600, perc.under =  80)
table(new.data$Clicked)

rand.forest <- randomForest(Clicked ~., data=new.data, mtry = 7,
                            importance = TRUE, proximity=TRUE, ntree = 1000
                            )


#confusion matrix
table(yhat.rf, test$Clicked)

yhat.rf   0   1
      0 889  47
      1  57   7

4 Answers

Balancing your dataset does not guarantee an even prediction split. Imagine the case where your features cannot separate between positive and negative examples at all. In this case, even if you balance the dataset, you will learn a decision boundary that essentially randomly guesses on each example. You would therefore expect that your prediction error would mirror the distribution of the majority/minority classes.

In this scenario you might but have strongly predictive features or you may have insufficient data.

Answered by jamesmf on December 18, 2020

SMOTE is not designed to work with severe data imbalance specially if you have wide variation within the minority class Try borderline SMOTE Or SMoteBoosting

Answered by Bashar Haddad on December 18, 2020

By experience, I would also consider to check the ROC and AUC. One might try to use under-sampling as well as other over-sampling methods. In R, you have this toolbox that can provide you different options.

You can also check this paper which provides a comparison between different methods and draw some insights when over- or under-sampling are preferable.

However, I would agree with jamesmf to check the discriminative power of your feature at first.

Answered by glemaitre on December 18, 2020

In my experience, giving weights to observations (if the algorithm in use supports it) generally works better for highly imbalanced classification problems. Since, your are using RandomForests I would suggest you to try that.

Answered by trailblazer on December 18, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP