Cross Validated Asked on November 29, 2021
I’m designing a logistic regression model to predict hospital mortality.
Why? To identify ‘adjusted’ odds ratios for a variable of interest on mortality.
Methods: – set up using a training dataset (75% of total)
When I run predictions on the test cohort (25%), I get the following model diagnostics:
Looking at the confusion matrix, the model is predicting the outcome to be the largest class – leading to a high accuracy but very poor model overall.
How can I improve the model?
Possible solutions?
I am almost certain that your logistic regression does not predict only one outcome, i.e., a probability of $hat{p}_i=0$ or $hat{p}_i=1$ for the target class for all instances $i$. Rather, it predicts some $hat{p}_iin[0,1]$, which you then compare to a threshold $theta$, which you chose in some way. Possibly, you use $theta=0.5$. You then label instance $i$ as "target class" or "non-target class" based on $hat{p}_i$ and $theta$. And it happens that $hat{p}_igeqtheta$ for all $i$ (or, equivalently, $hat{p}_ileqtheta$ for all $i$).
The solution to your conundrum is not to use a threshold and hard classification at all, but to deal directly with the probabilistic classification given by $hat{p}$. More information can be found at Reduce Classification Probability Threshold. I also recommend Why is accuracy not the best measure for assessing classification models?, because every criticism leveled there at accuracy applies equally to precision, recall etc.
Answered by Stephan Kolassa on November 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP