Is spurious regression a problem for lasso and similar techniques?

Question

I was toying with R to see how the number of variables might affect spurious regression. Suppose that we have an $I(1)$ vector $y$ and a matrix $X$ with $I(1)$ columns. If the two are not related then OLS regression will be disastrous, with up to 50% of $X$'S columns showing significance. On the other hand suppose I set
$$y =X_1beta + epsilon $$
where $X_1$ is the first column of the $X$ matrix and $epsilon$ is white noise. Then the regression works beautifully - the $y$ and $X_1$ form a cointegrating pair and the regression rightfully determines that the other columns are unrelated to the outcome, despite being nonstationary.
This begs the question - in situations where you have thousands or more variables and you would use regularized regression techniques, is spurious regression a problem? It seems that as long as there's at least one variable related to the outcome your regression will be fine.
The code for my experiment:
    nruns <- 1000
    nobs <- 1000
    nvars <- 100
    significant_coefs <- numeric(nruns)
    
    for(i in 1:nruns) {
      X <- replicate(nvars, cumsum(rnorm(nobs)))
      y <- X[, 1] + rnorm(nobs, sd = 1000)
      
      model <- lm(y ~ X)
      significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <=
                                    0.05)
    }
    
    hist(significant_coefs)

To see the impact of spurious regression just change the $y$ variable to a random walk.
    nruns <- 1000
    nobs <- 1000
    nvars <- 100
    significant_coefs <- numeric(nruns)
    
    for(i in 1:nruns) {
      X <- replicate(nvars, cumsum(rnorm(nobs)))
      y <- cumsum(rnorm(nobs))
      
      model <- lm(y ~ X)
      significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <= 
                               0.05)
    }
    
    hist(significant_coefs)

In the first case I get an average of 6 coefficients with p-values less than 0.05, in the second I get 51.

rinspy · Answer

In the scenario you described, you would usually use cross-validation to tune the regularization parameter of the regression. Cross-validation will tell you if the relationships you identified were spurious, since your model would have poor performance on the validation sets (but with potentially high variance). In that sense, spurious regression will not be a problem since you will know that your model has no predictive power.

However, performing linear regression on non-iid data is a bad idea. Your model will likely pick up spurious correlations, even though you will know that they are spurious through cross-validation. You should transform the data to stationary before performing the regression. This will allow the model to ignore irrelevant variables with a similar trend to your output variable.

Is spurious regression a problem for lasso and similar techniques?

One Answer

Add your own answers!

Ask a Question