How do I make inference about test metrics for entire population from sample metrics?

Question

Generally we calculate specific metrics for ML models on a test set (and we try to make that test set representative). I'm not clear on how to make inference about the same metrics for the population that the test set is representing - i.e., say I want to answer: if the model were to run on the whole population, what's the confidence interval of metric in question at (e.g.) 95% significance level?
Now for a simple case I can try to use my basic stats knowledge: suppose I have a binary classification model and I'm interested in reporting its precision.

I measure the precision on the test set and define the test statistic as the sample proportion $hat p$ of correctly classified examples out of total examples
I also run the model on different folds of data to get precision on each fold, and then calculate the standard deviation of those different sample (fold) precision values - call it $barsigma$: this is my proxy for the standard deviation of the sampling proportion distribution.
ALTERNATIVELY, I can measure the standard deviation for each fold $i$ as $sigma_i=sqrt{np_i(1-p_i)}$ where $p_i$ is the precision measured in that fold. I'm assuming Binomial distribution with sequence size $n$ and probability of "success" (correct prediction) as the precision $p_i$. Then I take the average of all these $sigma_i$ to get an estimate of the "population standard deviation" and then divide that by $sqrt{n}$. i.e. If the number of folds I considered was $k$, then $bar sigma=Sigma_{j=1}^ksigma_j/(ksqrt{n})$
Using either of the methods in 2 or 3 to calculate $barsigma$, we estimate the population precision as $hat ppm 1.96barsigma$

Or I could just calculate the interval as (assuming test set size $m$)$$hat ppm t_{m,95%}.sqrt{frac{hat p(1-hat p)}{m}}$$
where $t_{m,95%}$ is the t-distribution value corresponding to 95% significance level and sample size $m$.
But what about other metrics like precision-recall combo, mean absolute percentage error, mean absolute error, RMSE, etc. etc.? Obviously I'm not expecting a recipe for each metric, but just a general idea on how we go about getting interval estimates for arbitrary metrics. Also, does the methodology described above seem correct?

How do I make inference about test metrics for entire population from sample metrics?

Add your own answers!

Ask a Question