Cross Validated Asked by StatCurious on February 18, 2021
I have data that I’ve collected from every element in my population of study (a census), and not a sample. I am hoping to use the data to develop a regression model to answer questions like for a one unit increase in $X$, what is the average increase I can expect in $Y$. I’m used to developing and interpreting regression models in the cases where I have a sample of data, in which case it makes sense to report the $p$-values associated with my estimated coefficients. However, in this case, would I just need to report the beta values from my model without the $p$-values since there is no sampling variability? If say, I developed a regression model that looked like this:
$Y=3+4.5X_1+2X_2$
Would I simply just say, for a unit unit increase in $X_1$, I can expect, on average $Y$ to increase by 4.5 units, holding all other variables constant (and then not report any $p$-values or confidence intervals since this is census data without variability)? Now, how would this change, if at all, if I were interested in making a prediction about a future observation, given $X_1$ and $X_2$?
Note, I do not wish to infer anything beyond my population for which I’ve collected completed data (i.e., I don’t really think of my data as having been generated from some super-population and I’m not interested in making any inferences to this population in the future either).
I think this question becomes much easier if I was just interested in comparing, say, males and females in my population. For example, if I were interested in looking at BMI for males and comparing BMI to females, I’d simply calculate the average BMI for males and the average BMI for females. If the BMI for males was 29.2 and the BMI for females was 28.4, then the true population difference in means would simply be 29.2-28.4=0.8 with no standard error. But in a multiple regression model this does’t seem so clear cut to me. Would I want to use/report a standard error or prediction interval when predicting a future observation given $X_1$ and $X_2$? It seems weird to me to get a prediction for a future observation without a prediction interval though.
Thanks.
Your intuition that it seems weird not to include measures of uncertainty in predictions is, I would say, correct. If your concern is solely in the sample you collected, then typically you wouldn't fit regression models. Means and averages are typically all that you would report. However, the fact that you are interested in regression models and in prediction in particular suggests that the greater "population" you wish to make inference on is not restricted to your sample. More specifically, it seems you wish to make inference in various "what-if" scenarios, e.g., what if participant 1's height was increased by 1 while every other attribute of his stays the same?
This falls more generally under the "causal inference" paradigm, which some would argue is at the heart of epidemiology. Now, methods for causal inference can be quite involved, but the simplest of these is the linear regression model, which we can write in equation form as $$ Y = beta_0 + beta_1X_1 + cdots + epsilon. $$ Although some textbooks may interpret $epsilon$ as sampling error, this is not necessary in general. $epsilon$ can represent general "uncertainty" in a modeling exercise, where the goal is to derive a simple, parsimonious, mathematical relationship between the various variables. $epsilon$ can also represent the effect of "all other variables you have not measured", which I think is the standard econometric interpretation. For your application, I think the former is more appropriate.
If you can gain access, the following article is a good discussion of the mathematical basis of various epidemiological models.
Answered by Tim Mak on February 18, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP