Data Science Asked by Henrique Nader on February 5, 2021
Training and testing data have around 1% positives, but the model predicts only around 0.1% as positives.
The model is an xgboost classifier.
I’ve tried calibration but it didn’t improve much. I also don’t want to pick thresholds since the final goal is to output probabilities.
What I want is for the model to have a number of classified positives similar to the number of positives in the actual data.
The first (and easiest) option is to make sure that your model is calibrated in probabilites. In Python, it means that you should pass the option binary:logistic in your fitting method.
The alternative is to transform the output of your model into probabilities. There are different approaches for that.
This could be achieved with some sort of regression techniques to find the relationship between probabilities and your output. Python's isotonic regression should work for that purpose. However, without more information on your score ditribution it is possible it doesn't work well.
This can also be achieved with platt scaling : transforming your output into binary prediction (0 and 1) with a threshold, then calibrate a logistic regression on those new variables. It is relatively easy to do, but in my experience doesn't necssarily work well with unbalanced problems with non-linear relationships.
Finally, there are some approaches that just correct the output depending on your model. For logistic regression that would mean changing your bias variable so that the overall predicted proportion match the one of your data-set. This also can be used to counter the effects of rare events (see this). I have found this to work wuite well with logistic regression. However, I am not sure if it would directly be appliable to XGBoost, but it could be worth a try.
Answered by lcrmorin on February 5, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP