Cross Validated Asked by badmax on December 25, 2021
I was toying with R to see how the number of variables might affect spurious regression. Suppose that we have an $I(1)$ vector $y$ and a matrix $X$ with $I(1)$ columns. If the two are not related then OLS regression will be disastrous, with up to 50% of $X$‘S columns showing significance. On the other hand suppose I set
$$y =X_1beta + epsilon $$
where $X_1$ is the first column of the $X$ matrix and $epsilon$ is white noise. Then the regression works beautifully – the $y$ and $X_1$ form a cointegrating pair and the regression rightfully determines that the other columns are unrelated to the outcome, despite being nonstationary.
This begs the question – in situations where you have thousands or more variables and you would use regularized regression techniques, is spurious regression a problem? It seems that as long as there’s at least one variable related to the outcome your regression will be fine.
The code for my experiment:
nruns <- 1000
nobs <- 1000
nvars <- 100
significant_coefs <- numeric(nruns)
for(i in 1:nruns) {
X <- replicate(nvars, cumsum(rnorm(nobs)))
y <- X[, 1] + rnorm(nobs, sd = 1000)
model <- lm(y ~ X)
significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <=
0.05)
}
hist(significant_coefs)
To see the impact of spurious regression just change the $y$ variable to a random walk.
nruns <- 1000
nobs <- 1000
nvars <- 100
significant_coefs <- numeric(nruns)
for(i in 1:nruns) {
X <- replicate(nvars, cumsum(rnorm(nobs)))
y <- cumsum(rnorm(nobs))
model <- lm(y ~ X)
significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <=
0.05)
}
hist(significant_coefs)
In the first case I get an average of 6 coefficients with p-values less than 0.05, in the second I get 51.
In the scenario you described, you would usually use cross-validation to tune the regularization parameter of the regression. Cross-validation will tell you if the relationships you identified were spurious, since your model would have poor performance on the validation sets (but with potentially high variance). In that sense, spurious regression will not be a problem since you will know that your model has no predictive power.
However, performing linear regression on non-iid data is a bad idea. Your model will likely pick up spurious correlations, even though you will know that they are spurious through cross-validation. You should transform the data to stationary before performing the regression. This will allow the model to ignore irrelevant variables with a similar trend to your output variable.
Answered by rinspy on December 25, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP