Data Science Asked on September 5, 2021
I am working with a really imbalanced dataset ($approx$ 1% of positive cases) for a classification problem. I know that class balancing is an important step in this scenario.
I have two questions:
Considering that I don’t want to put the 0/1 label, but just to order the record according to the output score (it is always a calibrated probability of being in the positive class), is it still a good idea to do class balancing or, considering the specific output required, it is useless?
Basically, I do not care about the cut-off point, but I just sort the record in order to identify the one with a higher probability of being positives.
Considering the really small percentage of positive cases, is it better to do over/under sampling? Is there any rule-of-thumb to decide the proportion of resampling?
Thank you in advance!
Some Python Sklearn models have this option : class_weight="balanced". By that, you specify to your algorithm your data are unbalanced, and it makes the changes by itself. You can try this on few models, I had a better result with this option than by using the Downsampling Majority Class technique in a same problem
Answered by BeamsAdept on September 5, 2021
Referring to a previous answer and a blog post (which I'm aware is not that relevant since the data is more balanced than yours), I think that your first approach should be without handling imbalance, and if you're happy with the results, no need to work towards balanced solutions.
As in many ML topics, the best way is to try, I recommend you to adapt the experiment in the blog post to your data.
However, a more specific answer to your question:
Answered by David Masip on September 5, 2021
With such a heavy imbalance and two classes (it seems) you could treat this as more of an outlier detection problem. You should read up on models and algorithms in that direction!
If you go forward with a traditional classification you need to balance the data set, consider methods such as SMOTE.
Depending on the size of your data I would generally recommend downsampling the majority class which avoids producing "synthetic" cases but advanced methods such as SMOTE basically take care of this decision for you.
Can you elaborate what you mean with your first question as well? A classification algorithm needs 0/1 labels and therefore the output score cannot be ordered in the way that you mention. Some classification algorithms put out a probability score instead of predicted label so if this is what you mean I can tell you that the imbalance will still be a problem.
Answered by Fnguyen on September 5, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP