How can proper scoring rules optimize the probabilistic prediction compared to improper scoring rules?

Question

I understand the fundamentals in the decision theory about accuracy being an improper scoring rule compared to other proper scoring rules like Brier score & log loss. And, that threshold setting in binary outcomes is highly subjective.
To give some background, the following questions actually emerged initially from my previous question about setting thresholds for binary predictions of a fire occurring. Here, I did not know the cost of false-positive and false-negative in case of fire and was, therefore, advised to use proper scoring rules. I understand one should use these proper scoring rules when one does not know the cost of misclassifying. But in my head, utilizing a proper scoring rule does not change the fact that there still is a probability of misclassifying a fire as no fire and vice versa, for instance.
(1) So, how can one be sure/argue that 1's are more likely to be predicted as 1's and vice versa just because a proper scoring rule is applied rather than an improper?
(2) How come the semi-proper scoring rule AUC  is sometimes suggested to be the evaluation parameter, as here, and completely seen as bogus other times?
(3) Is the confusion matrix and everything coming with it actually mostly used as it is comprehensible and easy to report to others?

Stephan Kolassa · Answer

I won't be able to answer all your questions, but here goes.

So, how can one be sure/argue that 1's are more likely to be predicted as 1's and vice versa just because a proper scoring rule is applied rather than an improper?

You can't be sure, but you can argue.
A scoring rule is a function $S$ that takes a probabilistic prediction or classification $hat{f}$ and a corresponding actual observation $y$ and maps these to a loss value, $S(hat{f},y)inmathbb{R}$.
Now, both $hat{f}$ and $y$ are random. For $y$, this is obvious, and for $hat{f}$, this is due to the fact that we typically sample predictors and corresponding noisy actuals and build our model based on this.
So it makes sense to consider the expectation of our scoring rule, $Ebig(S(hat{f},y)big)$. Let's only denote the unknown distribution $f$ of $y$ in this expectation for convenience: $E_{ysim f}big(S(hat{f},y)big)$.
A scoring rule is called proper if this expectation is minimized over all $hat{f}$ by the true distribution $f$:
$$ E_{ysim hat{f}}big(S(hat{f},y)big) leq E_{ysim f}big(S(hat{f},y)big). $$
(There is also the opposite convention, where scoring rules are positively oriented and maximized in this situation. We will stick with this one.)
Thus, if we have two competing probabilistic predictions $hat{f}$ and $hat{g}$, and one of them is the true distribution $f$, we expect the scoring rule to give us a lower (or at least not-higher) value for this one compared to the other.
And the arguing you asked about happens when we flip this around: one prediction $hat{f}$ gives us a lower scoring rule than another one $hat{g}$, so it stands to reason that $hat{f}$ is "closer" to the true $f$ than $hat{g}$. But of course, since we are only talking about expectations, it may well be that for our particular sample, a wrong prediction gave us a lower score than the true distribution.
(Also, I'll admit that we are committing a similar error in flipping implications as when people misinterpret $p$ values as probabilities for hypotheses.)
And if we do the exercise with an improper scoring rule, then the problem just is that this improper rule has no reason to be minimized by the true distribution - if it did, it would not be improper any more, but proper.

ow come the semi-proper scoring rule AUC is sometimes suggested to be the evaluation parameter, as here, and completely seen as bogus other times?

I'll be honest: I don't have a handle on this. It might be a good separate question.

Is the confusion matrix and everything coming with it actually mostly used as it is comprehensible and easy to report to others?

Well... People think they understand it. Just as they think they understand accuracy. Easily "understood" falsehoods often have an advantage over harder to understand truths.
(From your comment):

We will still get some misclassified fires and non-fires in my case with using proper scoring rules.

Yes, certainly. Proper scoring rules are not magic silver bullets that will give you perfect predictions. After all, they evaluate probabilistic predictions. If your prediction is 80% for class A, and this is the correct probability, then there is still a 20% chance for non-A.
Proper scoring rules have the advantage that they work in expectation. As above, they may not give you the best result in each and every instance. But they will work better than alternatives in the long run.
Finally, if you get bad predictions even with a proper scoring rule, then of course you need to revisit your model. Was there some predictor you didn't include, because you simply didn't know it? Very bad (probabilistic) predictions can be a source of a lot of learning.

How can proper scoring rules optimize the probabilistic prediction compared to improper scoring rules?

One Answer

Add your own answers!

Ask a Question