Data Science Asked on September 1, 2021
I’m a beginner in Machine learning and I’ve studied that collinearity among the predictor variables of a model is a huge problem since it can lead to unpredictable model behaviour and a large error. But, are there some models (say GLM) that are perhaps ‘okay’ with collinearity unlike the classic linear regression? It is said that the classic linear regression assumes there is no correlation between its independent variables.
This question arises because I was doing a project that said "If the input features are correlated with each other, it is better to use a generalized linear model since they would perform better than linear regression."
Can someone explain this ?
“It is said that the classic linear regression assumes there is no correlation between its independent variables”
Depending on your goal of doing the regression, this common statement is false.
Even with multicollinearity, you get that $hat{beta}=(X^TX)^{-1}X^Ty$ is the solution to the OLS optimization.
Even with multicollinearity, you get that $hat{beta}=(X^TX)^{-1}X^Ty$ is minimum variance linear unbiased estimator from the Gauss-Markov theorem.
What the Gauss-Markov theorem requires is that the error terms not be correlated. This is commonly confused for saying that the predictors are not correlated, but that is indeed a mistake.
There can be numerical instability when you do math on a computer, particularly as you approach perfect multicollinearity ($X^TX$ is close to singular, singular in the extreme case of perfect multicollinearity or correlation of $1$ between variables), but there is no inherent problem with multicollinearity if your goal is to predict.
Where multicollinearity can hurt is when you want to do inference on the parameters, which is rarely a goal in machine learning. When you have multicollinearity, the parameter standard errors are inflated, sapping away your power to tell that they are not zero. Philosophically, it also makes it hard to attribute an effect to a particular predictor if it is correlated with others. (Imagine a hospital wanting to know if it pays its neurosurgeons as much as its heart surgeons and sees that the heart surgeons make way more but also sees that the heart surgeons have much more experience. Do they make more because of their specialty or because of their experience?)
Multicollinearity also can mean that you might be able to use a smaller amount of variables to get almost as much information as the entire set of variables. For instance, if two predictors are highly correlated, it might not be worth including both; you might be better off leaving out one for the sake of model parsimony and having fewer parameters in your regression, but that is an empirical problem and up to the model-designer’s judgment.
Getting to the full GLM framework, the Gauss-Markov theorem does not apply, but the idea remains that there is no inherent issue with multicollinearity when your goal is to predict instead of to do parameter inference, which is the typical goal in machine learning.
Answered by Dave on September 1, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP