Data Science Asked on June 30, 2021
I have a dataset with 11 features, I noticed that manipulating these features (eg dropping one or some of them) doesn’t affect the error scores of training and testing data, so I had to check the importance of these features. Here’s the following:
As noticed the first feature has a very high contirbution. However, the rest have insignificant importance. Thus I tried to run the model using only the first feature. It was expected that the results scores will not decrease significantly as the rest 10 dropped features have very low feature importance. However, after running the experiment with only the first feature, the abs error percentage of the testing data increased significantly from 14.13010% to 22.96036%. why is this happeneing? I expected that the error will be almost near to the base testing results as I train using the feature which dominates the feature importance?
Also, some of these features are correlated (no more than .62 correlation), is this the reason why the scores can’t be so reliable? if so, what mertic can I use to test the feature importance for correlated features
I can´t give you an perfect answer because there is no code, dataset and the target what you want to achieve.
Because the feature importances from random forest, is calculated based on the training data given to the model, not on predictions on a test dataset. That means, that is not the true prediction power. You should check, if there are difference on training and test results, when you run a random forest model. Another oportunity is the permutation feature importance.
But I made a similar experience and solved this on another way.
With these 4 options, I got a better view into my dataset. Hope I could help you a bit.
Answered by martin on June 30, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP