TransWikia.com

Different significant variables but same Adjusted R-squared value

Data Science Asked by imguessing on March 21, 2021

I performed a multiple linear regression on 64 variables with 3 different models:

  1. Performed Multiple Linear Regression on all 64 variables
  2. Perform Feature Selection with Random Forest and then perform multiple linear regression on selected features
  3. Performed Stepwise Linear Regression

I achieved the same adjusted R squared value for all 3 models but different significant variables. How should I make sense of this? Which model should I go with?

Will appreciate any advice! Thank you!

2 Answers

It seems that removing features was not really helpful in bringing up the model fit. Differences in significance of features may be due to exclusion.

One thing you should try is regulation by lasso/ridge regression. In R, this can be easily implemented by the glmnet package. Here is a tutorial. This is the best feature selection method imo since there is a mathematical rational behind (see Introduction to Statistical learning for more background).

Lasso can shrink features to zero (viz drop them). Ridge cannot shrink features to zero. Just give it a try.

Hint: In linear regression with continuous features you can also add polynomials to your regression to increase fit using the poly function. You could also see if regression splines help you to deal with hidden non-linearity. The gam packages is a really good start. Here are the docs.

The book Introduction to Statistical Learning covers these topics in a very good way and comes with useful R examples. Here is the code.

Answered by Peter on March 21, 2021

64 variables is a lot for linear regression and I'd worry deeply about collinearity, interdependen variables, etc.

While a good basic assumption would be to go with the model with the fewest variables (adjusted R² being equal) I would urge you to go deeper here.

Have you performed factor analysis or PCA on the predictor variables before, would a simplified model using components or factors perform stronger and be more interpretable?

Regression really isn`t a good model to use if you just want to throw the bag at it. Depending on the motivation behind your problem (as @Spacedman pointed out) I would try more alternative models as well.

E.g. why use RF only for feature selection, why not for the whole regression? If you aim for prediction and predictive quality R² wouldn't be your main metric to look at anyway and you could try more algorithms like XGboost as well.

Answered by Fnguyen on March 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP