TransWikia.com

Quantifying the uncertainty of aggregated model predictions

Cross Validated Asked by kh_one on December 13, 2020

Say I have a binary response variable, Y, that I model using a logistic model with four predictors, A, B, C and D. To make matters concrete, imagine that Y = 1 designates a respondent registering support for something, and 0 an absence of support.

Having estimated the relevant parameters on some sample, S, I then want to see what proportion of 1s (i.e., support) I likely would have seen, had all observations in S taken on a particular value on A. Assume conditions for causal inference are satisfied for A, so that changing its value can be thought of as a (hypothetical) intervention.

So I create a "new" sample, S*, identical, to S, save for each observation taking on the desired value on A. I then use the fitted model to “predict” the probability of Y = 1 for each observation in that sample. Taking the mean of those predictions I get an estimated proportion of support under the relevant intervention.

My question is: how should I quantify the uncertainty of that estimate? I can think of three ways, but am not sure which one (if any) makes sense:

  1. Resample from the predicted probabilities of the model and bootstrap a confidence interval for the relevant mean that way.
  2. Calculate a confidence interval for the prediction made on each observation (the probability that respondent 1 registers support, etc.) like here, and then create a confidence interval for mean support by taking the means of the upr and lwr values of the individual predictions.
  3. Resample from S* to fit a large number of models, generate predictions on each model, and then bootstrap a confidence interval for the relevant mean from these predictions.

Any advice here would be greatly appreciated.

One Answer

If I am understanding you correctly, you want to create a confidence interval around the proportion of observations that would have resulted in $Y = 1$ given $A = a$. Your logistic regression already provides us with $P(Y = 1 | A = a, B = b, C = c, D = d)$. You are justified in taking the mean as a point estimate of $P(Y = 1|A = a)$ by the law of total probability. Because the proportion of observations where $Y = 1$ is logically equivalent to the probability that an observation will yield $Y = 1$ you could use the normal approximation computed on the predicted probabilities themselves to create a confidence interval.

Here is some R code that would do the trick

# fake data for demonstration
# 20 samples drawn from a uniform distribution between 0 and 1
predicted_probs <- runif(n = 50, min = 0, max = 1)

# estimate of the average probability
global_prob_est <- mean(predicted_probs)

# standard error of the estimate
global_prob_se <- sd(predicted_probs)/length(predicted_probs)

# 90% CI using the 5th and 95th percentiles of a normal distribution.
qnorm(p = (0.05, 0.95), mean = global_prob_est, sd = global_prob_se)

Answered by David Telson on December 13, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP