TransWikia.com

Preferred approaches for imbalanced data

Data Science Asked on March 31, 2021

I am building a binary classification model with imbalanced target variable (13% Class 1 vs 87% class 0). I am considering the following three options to handle the data imbalance

  1. Option1: Create a balanced training dataset where with 50% / 50% split of the target variable.
  2. Option 2: Samples the dataset as-is (i.e., 87% / 13% split) and use upsampling methods (e.g., SMOTE) to balance the target variable to 50% / 50% split.

  3. Option 3: Use learning methods with appropriate hyperparameters to account for data imbalance for example: scale_pos_weight in XGBoost, class_weight in LGBMRegressor, class_weight in RandomForestClassifier

Assuming I have enough available data, is the first option is always the best approach?
What are the Cons and Pros of each of the three methods? especially the 2nd and 3rd options (I assume that it is always preferred to avoiding creating new synthetic samples)

One Answer

I think it mostly depends on your dataset type! are you dealing with text? or image? or... and your features will tell which option is the best fit for your case....but according to my experience in most of the cases, option 1 and 2 besides they depend on your dataset and power of your features they need to be judge based on your model high bias or variance and they should inform you they are good or no! you need to do some experiment to figure out them or know your dataset well to find out adding or reducing dataset will affect your model performance or not!

and what I like to tell is try to use upsampling and downsampling methods same time to make your dataset balanced in a fair way(kinda)!....in this case (87% class 0 and 13% class 1)....upsample class 1 and downsample class 0! how much you need to upsample or how much downsample it is all your choice and definition of fairness in your dataset! and this definition could differ!

Answered by Hamed on March 31, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP