Data Science Asked by Sandy Lee on July 7, 2021
I have a logistic regression classifier that shows differing levels of performance for precision and recall at different probability boundaries as follows:
The default threshold for the classifier to decide which class something belongs to is 0.5. However, am I right in understanding that in order to get the best performance trade-off I should set the decision boundary to be about 0.82 below? That can be done in Scikit-Learn, but I want to make sure that I am drawing the correct conclusions. Any advice would be appreciated.
The intersection of the precision and recall curves is certainly a good choice, but it's not the only one possible.
The choice depends primarily on the application: in some applications having very high recall is crucial (e.g. a fire alarm system), whereas in some other applications precision is more important (e.g. deciding if somebody needs a risky medical treatment). Of course if your application needs high recall you'd choose a threshold before 0.6, if it needs high precision you'd choose a threshold around 0.85-0.9.
If none of these cases apply, people usually choose an evaluation metric to optimize: F1-score would be a common one, sometimes accuracy (but don't use accuracy if there is strong class imbalance). It's likely that the F1-score would be optimal around the point where the two curves intersect, but it's not sure: for example it might be a bit before 0.8, when the recall decreases slowly and the precision increases fast (this is just an example, I'm not sure of course).
My point is that even if it's a perfectly reasonable choice in this case, in general there's no particular reason to automatically choose the point where precision and recall are equal.
Correct answer by Erwan on July 7, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP