Data Science Asked on January 9, 2021
As far as I can tell, broadly speaking, there are three ways of dealing with binary imbalanced datasets:
Option 1:
Option 2:
Option 3:
scale_pos_weight
( https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html ).My main question is if I correctly interpret what the options are. Is there any conceptual mistake in what I’m saying? Is it appropriate to use Stratified k-fold in the three cases when dealing with imbalance? Is it not necessary to apply any resampling when using XGBoost and tuning scale_pos_weight
? When some resampling is applied (Options 1 and 3) does it make sense to use a “traditional” metric and does not make sense to use an “alternative” metric? In general, the resampling has to be applied separately on training and test sets? Etc.
Also, it would be nice if you have any good reference to SMOTE and ROSE, regarding how they work, how to apply them and how to use them with python.
your resume is quite good, but I'm not comfortable in dividing the broad discussion to those three more or less sharply separated roads. But indeed, often a technique similar to one of those is chosen. Just let me underline something about them:
the Area Under the Curve (AUC) of Precision and Recall has been shown as being slightly better than AUC ROC, but don't expect miracles. They are quite similar in practical situations. In facts, the performance of the metric you choose (AUC ROC, AUC Precision-Recall or similar), intended as it's effect on the test set, depends strongly on the issue you are trying to solve.
The option of XGBoost scale_pos_weight
helps, but does not miracles. If you already choose to use XGBoost, then activate it. But don't choose XGBoost just because have this option available.
Also, consider that:
in real-world application your samples reflect a set of biases of the System which collect them. This means that if you have small imbalance (up to 1 to 20 or even 100) is better to avoid any balancing technique and keep the imbalance as it is. This will probably guarantee a better model in production. For the same reason, sometimes using non-stratified k-fold gives better models than the stratified version.
high-thought synthetic samples creation techniques as SMOTE often miss their target. This because the complexity of the issue you are trying to solve can be so high that SMOTE (or similar techniques) is anyway unable to create decent synthetic samples. In those cases, is better to just use undersampling or (better) oversampling. Indeed, just duplicating the samples of the lower samples class is an astonishing good technique, in terms of performance.
Answered by Vincenzo Lavorini on January 9, 2021
Imbalanced class means the count of one class is too low compared to the count of other Class. This means Model will have little opportunity to learn the minority Class.
We have these option to handle the issue.
Key goal is to reduce the fog created by the majority Class and let the Model see the Minority class too -
Weighted Class - This instructs the learning process not to treat the classes equally and use the specified ratio. e.g. Loss will be 10 times if the minority Class is misclassified (assuming a 10:1 weight). This is just one approach, weight can be used in other ways too.
Over-sampling - Creating dummy minority class data using the available data
Under-sampling - Sampling the majority Class to a small number to improve the ratio.
Metrics - I think the only thing you should be mindful of is Combined accuracy will not work. You may simply measure individual class accuracy. Or can use Precision-Recall
Rule -
You need not keep any specific rule, you have to apply all the above method individually Or in-combination (This is important, sometimes we apply both Over and under-sampling together) and see what works.
Also, keep in mind, Imbalance is a separate problem which has nothing to do with decision boundary. If the data has a very vague class decision boundary, then that would be a separate issue.
In that case, even applying the above methods, you might not get a good result.
Stratified k-fold
Again, nothing to do with Imbalanced. You should always use a Stratified split. I will iterate again - Keep general ML issue/solution separate from Imbalanced data issue.
Good reads
Jeremy Jordan - Blog
Tom Fawcett - Blog
Haibo He and Edwardo A. Garcia - Paper
Imb-learn Library
Jason Brownlee - SMOTe
Learning from Imbalanced Data Sets - Book
Answered by 10xAI on January 9, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP