Data Science Asked on November 26, 2020
Based on a previous post, I understand the need to ensure that the validation folds during the CV process have the same imbalanced distribution as the original dataset when training a binary classification model with imbalance dataset. My question is regarding the best training schema.
Let’s assume that I have an imbalanced dataset with 5M samples where 90% are pos class vs 10% neg class, and I am going to use 5-folds CV for model tuning. Also, let’s assume I will hold out a random 100K samples for test (90K samples w/ pos class vs 10K samples w/ neg class). Now I have two options:
Option 1)
However, given that I have enough data, I want to avoid using any data balancing algorithm for the training folds.
Option 2)
I am also clear that I have a 3rd option, which is based on the 1st option above, where the model could be trained on an imbalanced dataset. Therefore, a data balancing algorithm can be avoided.
My questions are:
I'm not sure if there's a question here, but I'll add some comments.
Firstly, if you can get it in the wild, always work with balanced data. However, if you are going to manually create a "balanced" data set yourself, make sure that the selection criteria that you use to create that data is appropriate. As an example, choosing the 100k most recent positive and negative outcomes may not be appropriate because the time frame of the positive outcomes may extend well beyond that of the more common negative outcomes. So in this 200k data set, your negative 100k outcomes may relate to data from the last year while the data relating to your 100k positive outcomes may relate to the last ten years.
Secondly, if you are going to balance your data be aware of how the balancing technique works and try to understand its weaknesses / limitations. Be mindful that rebalancing a data set will result in a new data set, and remember that you will have to check that the new data set is still appropriate to use. As an example, you will need to check that the distribution of input variables is still roughly the same as before. Applying this thinking to your options above, can you be certain that the data in each fold will be roughly similar?
Lastly, if you are going to use a modelling framework which can handle imbalanced data then make sure you understand why it can handle the imbalanced data. In particular, if the framework applies some weighting / balancing technique in the background you should be aware of this and be able to explain it.
Answered by bradS on November 26, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP