Cross Validated Asked by pankaj negi on November 2, 2021
I am building a binary classification model with proportion of 1 is at only 3% and total 70000 data points.I have 5 variables out of which 3 are coming out to be important.
I have built model using logistic regression and GBM. On doing cross validation my model’s log loss is 0.11. However, when I plot the probabilities I see that they are clustered at the extreme ends with almost no cases in between. The mean probability is 0.08 and median is 0.01.
I am building a scoring model so I am interested in the probabilities given by the model. Any ideas on why this could happening?
Well, firstly, it could be a good thing. If it is easy to predict for every case what class it belongs to, then you could see this kind of behaviour. The main problem in that scenario - as mentioned by the answer by cdalitz - is that you might run into perfect separation, which is especially challenging for logistic regression fit using maximum likelihood (possible approaches to deal with it: exact logistic regression, Firth's correction, Bayesian logistic regression, elastic-net-/LASSO-/ridge-logistic regression etc.). Depending on how outcomes are distributed by predictor variable this may or may not be occuring here - one possible hint are odds coefficients (e.g. really huge coefficients like >10 or <-10 on the logit scale) and standard erros in the model outputs (some implementations may have good diagnostic tools to warn you, others may not; the term to read up on is "[complete] separation").
Secondly, it could be a bad thing in terms of overfitting (esp. if there are few records relative to the number of predictors), where (nearly) perfect separation of classes by predictors occurs, but really only by chance due to the small sample size. This will then not generalize well to new data you want to predict. Some of the same regularization techniques mentioned above might help with logistic regression and picking suitable hyperparameters for boosting models (e.g. through cross-validation) might help for boosting models.
Thirdly, especially for boosting (and some other models, e.g. this also happens with neural networks), it is a known thing that predicted probabilities tend to inappropriately cluster towards extremes (the topic to search for is "calibration" - or in this case potentially the lack thereof). In contrast, this tends to be less of a problem with "normal" (or ridge/elastic-net/LASSO logistic regression). There are a number of possible fixes such as isotonic-/Platt-scaling of predicted probabilities and using loss functions that alleviate the problem (e.g. I recently saw focal loss proposed for this purpose).
[Added] Final possibility: If the predictions are on the same data that the model are trained on (less of a problem when only applied to out-of-fold predictions in cross-validation, which are typically less overfit, except for the overfitting that occurs due to hyperparameter tuning via the cross-validation), then they will naturally be overfit unless the training data is very large (it gets worse with class imbalance and with some pretty strong and/or imbalanced predictors).
Answered by Björn on November 2, 2021
Maximum likelihood estimation of the parameters of a logistic regression is ill defined if the classes are linearly separable. In this case the parameters will go to +/- infinity. There are workarounds by introducing a regularization term, but you could try a linear discrimination analysis first (R function lda), because it possibly already yields perfect results.
Answered by cdalitz on November 2, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP