Cross Validated Asked on February 2, 2021
It is shown in answer here and at other places that difference of 2 random variables will be correlated with baseline. Hence baseline should not be a predictor for change in regression equations. It can be checked with R code below:
> N=200
> x1 <- rnorm(N, 50, 10)
> x2 <- rnorm(N, 50, 10)
> change = x2 - x1
> summary(lm(change ~ x1))
Call:
lm(formula = change ~ x1)
Residuals:
Min 1Q Median 3Q Max
-28.3658 -8.5504 -0.3778 7.9728 27.5865
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.78524 3.67257 13.83 <0.0000000000000002 ***
x1 -1.03594 0.07241 -14.31 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.93 on 198 degrees of freedom
Multiple R-squared: 0.5083, Adjusted R-squared: 0.5058
F-statistic: 204.7 on 1 and 198 DF, p-value: < 0.00000000000000022
The plot between x1 (baseline) and change shows an inverse relation:
However, in many studies (especially, biomedical) baseline is kept as a covariate with change as outcome. This is because intuitively it is thought that change brought about by effective interventions may or may not be related to baseline level. Hence, they are kept in regression equation.
I have following questions in this regard:
Is there any mathematical proof showing that changes (random or those caused by effective interventions) always correlate with baseline? Does it occur only in some circumstances or is it a universal phenomenon? Is distribution of data related to this?
Also, does keeping baseline as one predictor of change affects results for other predictors which are not having any interaction with baseline? For example in regression equation: change ~ baseline + age + gender
. Will results for age and gender be invalid in this analysis?
Is there any way to correct for this effect, if there is a biological reason to think that change may DIRECTLY related to baseline (quite common in biological systems)?
Thanks for your insight.
Edit: I probably should have labelled x1 and x2 as y1 and y2 since were discussing response.
Some links on this subject:
Difference between Repeated measures ANOVA, ANCOVA and Linear mixed effects model
What are the worst (commonly adopted) ideas/principles in statistics?
What are the worst (commonly adopted) ideas/principles in statistics?
- Is there any mathematical proof showing that changes (random or those caused by effective interventions) always correlate with baseline? Does it occur only in some circumstances or is it a universal phenomenon? Is distribution of data related to this?
We are interested in the covariance of $X$ and $X-Y$ where $X$ and $Y$ may not be independent:
$$ begin{align*} text{Cov}(X,X-Y) &=mathbb{E}[(X)(X-Y)]-mathbb{E}[X]mathbb{E}[X-Y] \ &=mathbb{E}[X^2-XY]-(mathbb{E}[X])^2 + mathbb{E}[X]mathbb{E}[Y] \ &=mathbb{E}[X^2]-mathbb{E}[XY]-(mathbb{E}[X])^2 + mathbb{E}[X]mathbb{E}[Y] \ &=text{Var}(X)-mathbb{E}[XY] + mathbb{E}[X]mathbb{E}[Y] \ &=text{Var}(X) - text{Cov}(X,Y) end{align*} $$
So yes, this is always a problem.
- Also, does keeping baseline as one predictor of change affects results for other predictors which are not having any interaction with baseline? For example in regression equation: change ~ baseline + age + gender. Will results for age and gender be invalid in this analysis?
The whole analysis is invalid. The estimate for age
is the expected association of age
with change
while keeping basline
constant. Maybe you can make sense of that, and maybe it does make sense but you are fitting a model where you invoke a spurious association (or distort an actual association), so don't do it.
- Is there any way to correct for this effect, if there is a biological reason to think that change may DIRECTLY related to baseline (quite common in biological systems)?
Yes, this is very common as you say. Fit a multilevel model (mixed effects model) with 2 time points per participant (baseline and follow up), coded as -1 and +1. If you want to allow for differential treatment effects and then you can fit random slopes too.
An alternatives is Oldham's method but that also has it's drawbacks.
See Tu and Gilthore (2007) "Revisiting the relation between change and initial value: a review and evaluation" https://pubmed.ncbi.nlm.nih.gov/16526009
Answered by Robert Long on February 2, 2021
Consider an agricultural experiment with yield as the response variable and fertilizers as the explanatory variables. In each field, one fertilizers (can be none also) is applied. Consider the following scenario:
(1) There are three fertilizers, say n, p, k. For each of them we can include an effect in our linear model, and take our model as $$y_{ij} =alpha_i + varepsilon_{ij}.$$ Here $alpha_i$ has to be interpreted as the effect of the $i$-th fertilizer.
(2) There are 2 fertilizers (say p, k) and on some of the fields, no fertilizer has been applied (this is like placebo in medical experiments). Now here it is more intuitive to set the none-effect as the baseline and take the model as $$y_{ij} = mu + alpha_{ij} +varepsilon_{ij}$$ where $mu$ accounts for the none effect, $alpha_1 = 0$ and $alpha_2, alpha_3$ have to be interpreted as the "extra" effect of the fertilizers p, k.
Thus, when it seems appropriate to take a baseline, other effects are considered as the "extra" effect of that explanatory variable. Of course we can take a baseline for scenario (1) as well: Define $mu$ as the overall effect and $alpha_i$ to be the extra effect of the $i$-th fertilizer.
In medical experiments, sometimes we come accross a similar scenario. We set a baseline for the overall effect and define the coefficients for the "extra effect". When we consider such baseline, our assumption does not remain that the marginal effects are independent. We rather assume that the overall effect and the extra effects are independent. Such assumptions on the model mainly come from field experience, not from a mathematical point of view.
For your example (mentioned in the comments below), where $y_1$ was the height at the beginning and $y_2$ is the height after 3 months, after applying fertilizer, we can indeed have $y_2 - y_1$ as our response and $y_1$ as our predictor. But my point is that in most of the cases, we won't assume $y_1$ and $y_2$ to be independent (that would be unrealistic, because you have applied a fertilizer on $y_1$ to get $y_2$). When $y_1$ and $y_2$ are independent, you get theoretically that they are negatively correlated. But here this is not the case. In fact, in many cases you will find that $y_2-y_1$ is positively correlated with $y_1$, indicating that for greater height of the response, the fertilizer increases the height more, i.e., becomes more effective.
Answered by Aditya Ghosh on February 2, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP