TransWikia.com

Follow up question regarding Upsampling for Imbalanced Data and the use of ADASYN instead of SMOTE

Data Science Asked by Ammar Kamran on March 24, 2021

I have a follow-up question regarding this topic.

I have been working on a project predicting success(1) or failure(0) for organizations by using the Decision Tree and Random Forest algorithms.

My dataset has a minority class of successes which I would like to upsample using SMOTE or ADASYN.

I understand that the reasoning mentioned in this post applies to SMOTE and random upsampling by duplicating but does this also apply to upsampling via ADASYN? As I under ADASYN introduces even more randomness to the synthetic new observations so perhaps the correlation might be lower? In other words, does the use of ADASYN justify upsampling before the split or even upsampling the training and testing data separately?

I have seen a research paper that first applied the train test split and then upsampled the minority class using ADASYN in the testing and training dataset separately. This approach made better sense to me since as compared to upsampling before the train test split which introduces the possibility of leakage from the training to the test data this approach instead removes that possibility of leakage by separately upsampling the training and testing dataset. I have heard that this approach is also not fully correct since the testing dataset is supposed to replicate the real world and hence we are not supposed to change it in any way.

On the other hand instead of upsampling the minority class in the test dataset, I can even downsample the majority class which might be a better approach since the testing dataset remains still has only observations from the real world. Here we just gave the algorithm a fair chance (50:50) to pick between each class (1 or 0). Although once again the real world most likely will not have the 1s and 0s in equal proportions.

Additionally, some places also suggest doing the train test split proportionally so the training and testing datasets have the minority and majority class in equal proportions. As I understand this can be done using stratify=y when running the code. Please let me know if I need to do this and why?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP