TransWikia.com

How to incorporate the uncertainty of the model coefficients in the prediction interval of a multiple linear regression

Physics Asked by DannyVanpoucke on November 7, 2020

I’m dealing with the modeling of small experimental physics data sets (specifically the stickiness of glue-compounds). As most experimental work does not generate thousands of samples, but rather a handful, I need to be inventive in how to deal with this small number of data sets (say 10-20). At this point I have a model-framework (regression see below at PSS) which can deal with this rather well.

However, to have a better picture of the accuracy of my predictions, I want to have an error-bar on my predicted values, this to check how well my predictions predict new experiments. As this work is numerical in nature, the error-bar will be originating from the underlying theoretical model, how do these errors propagate (i.e., error-analysis as one is used to in experimental physics)

For the sake of simplicity, assume that I am dealing with a multiple linear regression model, say (in reality there will be many many more terms):
$$
y = beta_0 + beta_1 x_1 + beta_2 x_2 tag{1}
$$

What I am looking for is an algebraic way of calculating (numerically) error-bars (in actuality its the prediction interval (PI) or confidence interval(CI), as both are related). In statistics literature, there are references to such a problem, and examples of how the PI and CI can be calculated. However, these only consider the variability of the $x$‘s. The PI and CI are then related to (cf. question question 147242):
$$
hat{V}_f=s^2cdotmathbf{x_0}cdotmathbf{(X^TX)^{-1}}cdotmathbf{x_0^T} + s^2 tag{2}
$$

In contrast to these, each of my model coefficients[see: PSS below] ($beta_0, beta_1$ and $beta_2$) in this case have an error-bar (extracted via bootstrapping from a distribution, with the distributions being numerical in nature not analytic, and the distributions are specific for each of the three coefficients).
Is there a way to incorporate the uncertainty of the $beta_i$‘s (c.q. the "error-bars") in the calculation of the PI (and CI).

To put it very simple, how can the equation
$$
hat{V}_f=s^2cdotmathbf{x_0}cdotmathbf{(X^TX)^{-1}}cdotmathbf{x_0^T} + s^2 tag{3}
$$

be modified to also incorporate the fact that the coefficients themselves are a mean of a distribution.

(PS: One could create an ensemble of various model instances with the $beta_i$ drawn from their respective distributions, and based on the distribution of obtained $y_0$ calculate the CI of the $y_0$, but this is not really computationally efficient and brings a lot of other issues which I would like to avoid.)

(PPS:
The regression model presented is not the result of a direct regression toward a single data set, instead it is constructed as follows:

  1. Create an ensemble of N data sets.
  2. On each data set a regression gives rise to a linear model as indicated in the post above. This gives rise to N values for each of the coefficients $beta$.
  3. The mean of each of the three sets is calculated.
  4. These three mean coefficients are the coefficients of the model presented above.
  5. The goal here: find the prediction interval for the averaged model above taking into account the fact that the coefficients $beta$ are calculated from numerical distributions.)

2 Answers

I don't totally understand the post you linked, it seems they are implicitely assuming they have a model for how $vec{x}_0$ is generated, which is not true in the generic case... However, if I understand your question, the most generic and simplest solution to achieve what you want is bootstrapping your prediction intervals. The basic idea is to use each of your $N$ sets of data to produce a vector $vec{beta}$, then stack your $vec{beta}$ into a matrix

$$B = begin{bmatrix}vec{beta}_1 vec{beta}_2 vdots vec{beta}_N end{bmatrix}.$$

Now your distribution of outputs is $Bcdotvec{x}_0$, and you can do statistics on the elements of that vector present confidence intervals.

Answered by Bobak Hashemi on November 7, 2020

This is a problem that is essentially tailor-made for Bayesian analysis. The output of a Bayesian analysis is the joint distribution of all of your model coefficients. So, you can simulate samples from the predicted data by first drawing a sample from the model coefficients and then using those model coefficients to draw a sample from the data. This is called the "posterior predictive distribution". It is commonly used in Bayesian analysis to evaluate the validity of the model. If your model reasonably approximates your data generation process then your actual data should be reasonably similar to your posterior predicted data.

I recommend using the rstanarm package in R. IMO, even if you don't know R it is worth learning it just to use this package.

https://mc-stan.org/rstanarm/

Answered by Dale on November 7, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP