Different significant variables but same Adjusted R-squared value

Question

I performed a multiple linear regression on 64 variables with 3 different models:

Performed Multiple Linear Regression on all 64 variables
Perform Feature Selection with Random Forest and then perform multiple linear regression on selected features
Performed Stepwise Linear Regression

I achieved the same adjusted R squared value for all 3 models but different significant variables. How should I make sense of this? Which model should I go with?

Will appreciate any advice! Thank you!

Peter · Answer

It seems that removing features was not really helpful in bringing up the model fit. Differences in significance of features may be due to exclusion.

One thing you should try is regulation by lasso/ridge regression. In R, this can be easily implemented by the glmnet package. Here is a tutorial. This is the best feature selection method imo since there is a mathematical rational behind (see Introduction to Statistical learning for more background).

Lasso can shrink features to zero (viz drop them). Ridge cannot shrink features to zero. Just give it a try.

Hint: In linear regression with continuous features you can also add polynomials to your regression to increase fit using the poly function. You could also see if regression splines help you to deal with hidden non-linearity. The gam packages is a really good start. Here are the docs.

The book Introduction to Statistical Learning covers these topics in a very good way and comes with useful R examples. Here is the code.

Fnguyen · Answer

64 variables is a lot for linear regression and I'd worry deeply about collinearity, interdependen variables, etc.

While a good basic assumption would be to go with the model with the fewest variables (adjusted R² being equal) I would urge you to go deeper here.

Have you performed factor analysis or PCA on the predictor variables before, would a simplified model using components or factors perform stronger and be more interpretable?

Regression really isn`t a good model to use if you just want to throw the bag at it. Depending on the motivation behind your problem (as @Spacedman pointed out) I would try more alternative models as well.

E.g. why use RF only for feature selection, why not for the whole regression? If you aim for prediction and predictive quality R² wouldn't be your main metric to look at anyway and you could try more algorithms like XGboost as well.

Different significant variables but same Adjusted R-squared value

2 Answers

Add your own answers!

Ask a Question