Cross Validated Asked by Baktaawar on December 27, 2021
I have been doing ML for quite some time and I have a thought in class imbalance problems that has bothered me quite a lot.
In problems where we have Imbalanced Dataset (one class is far more frequent than the other class) we have a whole area of using Class Imbalance Techniques to mitigate it. Like resampling, adding class weights in proportion to class size in ML algorithms while training, generating synthetic instances of minority class (SMOTE) etc.
But my problem is we do all that for training data. Real world test data is imbalanced. Shouldn’t be not modify the training data to make it balanced so that it mimics real world data still?
Yeah I know how above techniques help and all. My point is this is biasing the data if real world data is going to see less of minority class. In training we are biasing the data by making algorithm see more of it than what it would see in real life.
What is the right approach here?
I'm not sure if this is an answer or not, but I'll throw in my two cents.
Real world test data is imbalanced. Shouldn’t be not modify the training data to make it balanced so that it mimics real world data still?
You're referring to the prevalence of classes in the real world. This is an important point to make when you're doing something like risk modelling for medical diagnoses (e.g. your risk of heart attack). If the prevalence of the positive class is low, your risk model should respect that. Resampling for the sake of having a class balance artificially increases baseline risk to 50%.
Classification is something different however. Frank Harrell writes that classification should really only be used when the class is quite obvious and there is high signal to noise (e.g. is this a picture of a dog or not). In that case, prevalence shouldn't really be an issue. You want your algorithm to be able to learn differences between classes, and in my own opinion, their prevalence in the real world is orthogonal to that goal.
So as with everything, the answer depends on what you're doing. If risk of an event occurring is important, and classes are rare, then resampling can turn a perfectly good model bad. However if you just want your computer to distinguish chihuahuas from blueberry muffins, then the prevalence in the real world of either is not important.
Answered by Demetri Pananos on December 27, 2021
In real world, many imbalanced class problems have heavy cost on misclassification. The minority class might be rare, but one occurrence of that class will have really great impact. The minority class is oftentimes "the goal/point" to avoid or to obtain, not "some useless noise class".
This is enough to justify resampling: you'll want the algorithm to be able to not misclassify the minority class. Algorithm that sees imbalanced class data will have less information on whether should it classify an observation as the minority or not. In the end, it will often just label them the majority class.
My point is this is biasing the data if real world data is going to see less of minority class. In training we are biasing the data by making algorithm see more of it than what it would see in real life.
The point of having the algorithm is to use its predictive ability. You will want to have the algorithm predicting correctly, that's it.
Whether or not the algorithm sees the data as it is in real life is not the point. If it is the point, say goodbye to feature engineering as well.
p.s.:
We can stretch this and extrapolate to how humans see imbalanced data. Humans also (kind of) do "resampling/weighting", by remembering more intensely things that are "rare but has great impact", and not the "things that happen everyday and boring". It balances out so the human both remembers "the one thing that happen and changed my life" and "the thing I do everyday, generally".
Answered by Nuclear03020704 on December 27, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP