Data Science Asked on September 5, 2020
I read in comments a recommendation for decision tree´s instead of linear models like neural network, when the dataset has many correlated features. Because to avoid multicollinearity.
A similar question is already placed, but not really answered.
https://stats.stackexchange.com/questions/137573/do-classification-trees-need-to-consider-the-correlation-between-attributes
or here
In supervised learning, why is it bad to have correlated features?
My problem:
I have a dataset of about 30 columns. 10 columns have a high correlation with the target/dependend variable. Data are numerical. I would like to do a prediction (regression model) include all variables if possible?
One big problem is to avoid multicollinearity.
To answer your questions directly, first:
Is there a decision tree regression model good when 10 features are high correlated?
Yes, definitely. But even better than decision trees, is many decision trees (RandomForest, Gradient Boosting (xGBoost is popular). I think you'd be well served by learning about how decision trees split, and how they naturally deal with collinearity. Maybe try this video Follow the logic until the 2nd tier of splits, and you'll be able to imagine how the correlated variables are suddenly not important because they're relative to the split above them.
Is there a scientific or math explanation or recommendation (to use decision tree regression)?
The math explanation of why collinearity is "bad" for linear models, comes down to the coefficients and how you interpret them. One of the side effects is that they can undermine the statistical significance of a variable, as well as flip their coefficients the wrong direction. It usually doesn't affect the accuracy of the model very much, but most people want linear models so that they can interpret the coefficients (which is totally messed up with collinearity). I suggest reading maybe this article to start.
One of the things that you mentioned, include all variables if possible?
is not really something you should be concerned with. The goal of a model is to explain the most, with the least. If you're forcing as many variables as possible into the model, then it's possible that you'll be fooled into thinking a model is good, when in fact it isn't if you were to test it on new data. In fact, sometimes less variables will give you a better model. This is exactly the kind of problem that multicollinearity causes with linear models - that you can't really judge very well what variables are significant or not. Stepwise selection doesn't work very well when there are correlated features.
In general, I think decision trees - especially Random Forests - will be a good start for you. But remember not to force all of the variables into the model just for the sake of it. Experiment with using less variables and manipulating the tree structure such as leaf size and max depth. And as always - test your model on validation data and holdout data so that you don't overfit a model and fool yourself into thinking it's a strong model.
Correct answer by Josh on September 5, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP