Data Science Asked by Srinath Ganesh on December 29, 2020
Note: I have already looked at Difference between RFE and SelectFromModel in Scikit-Learn post and my query is differnt from that post
Expectation:
SelectFromModel
and RFE
have similar/comparable performance in the model built using their recommendations.
Doubt: Is there any known usecase where RFE will fare better? As a student of data science (just starting learning) its a weird observation for me
Code:
# RecursiveFeatureElimination_ExtraTreesClassifier
from sklearn.feature_selection import RFE
from sklearn.ensemble import ExtraTreesClassifier
rfe_selector = RFE(estimator=ExtraTreesClassifier(), n_features_to_select=20, step=10)
rfe_selector.fit(x_raw, y_raw)
[x[0] for x in pandas.Series(rfe_selector.support_, index=x_raw.columns.values).items() if x[1]]
# returns
['loan_amnt','funded_amnt','funded_amnt_inv','term','int_rate','installment','grade','sub_grade','dti','initial_list_status','out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp','total_rec_int','recoveries','collection_recovery_fee','last_pymnt_amnt','next_pymnt_d']
# SelectFromModel_ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
selector = SelectFromModel(ExtraTreesClassifier(n_estimators=100), max_features=20)
selector.fit(x_raw, y_raw)
[x[0] for x in pandas.Series(selector.get_support(), index=x_raw.columns.values).items() if x[1]]
# prints
['loan_amnt','funded_amnt','funded_amnt_inv','term','installment','out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp','total_rec_int','recoveries','collection_recovery_fee','last_pymnt_d','last_pymnt_amnt','next_pymnt_d']
Code for Model train and test
# internal code to select what variables I want
x_train, y_train, x_test, y_test = get_train_test(var_set_type=4)
model = ExtraTreesClassifier()
model.fit(x_train, y_train)
# then just print the confusion matrix
ExtraTreesClassifier Model from SelectFromModel variables
ExtraTreesClassifier Model from RFE variables
My confusion matrix is powered by this Open Source project https://github.com/DTrimarchi10/confusion_matrix
To start with, let me repeat here what I have already answered to another OP wondering if Recursive feature selection may not yield higher performance?:
There is simply no guarantee that any kind of feature selection (backward, forward, recursive - you name it) will actually lead to better performance in general. None at all. Such tools are there for convenience only - they may work, or they may not. Best guide and ultimate judge is always the experiment.
Apart from some very specific cases in linear or logistic regression, most notably the Lasso (which, no coincidence, actually comes from statistics), or somewhat extreme cases with too many features (aka The curse of dimensionality), even when it works (or doesn't), there is not necessarily much to explain as to why (or why not).
Having clarified that, let's see your case in more detail.
None of the feature selection procedures here takes into account the model performance; in classification settings, the sole criterion by which features are deemed as "important" or not is the mean decrease in the Gini impurity achieved by splitting in the respective feature; for some background, see the following threads (although they are about Random Forests, the rationale is identical):
Although it is often implicitly assumed that a reduction of the features using this importance as a criterion may lead to gains in the performance metric, this is by no means certain and far from straightforward (I am actually repeating my intro here).
Given that, it would seem that the actual question here should be why the two methods end up selecting different features, for which the thread you have linked yourself, Difference between RFE and SelectFromModel in Scikit-Learn, is arguably relevant. In any case, the expectation that they should offer similar results is arguably not well-founded; the relative feature importance changes when features are removed from the model (RFE case), hence it cannot be directly compared with the SelectFromModel
approach (use all features, remove those with importance below a threshold). Similarly, the question "Is there any known use case where RFE will fare better?" is ill-posed - you have not shown that RFE is consistently inferior, and the results of a single experiment with a single dataset and a single parameter setting (such as the no. of required features and the threshold involved in SelectFromModel
) should not be generalized light-heartedly.
In any case, selecting features based on their Gini importance (Mean Decrease in Impurity - MDI) has started falling out of fashion, mainly because the calculated importance is spurious in cases of categorical features with high cardinality (see the academic paper The revival of the Gini importance?); in fact, there is already a relevant warning in the scikit-learn classifiers inorporating this "classical" feature_importances
attribute:
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance
as an alternative.
See the scikit-learn vignette Permutation Importance vs Random Forest Feature Importance (MDI) for a concrete example.
Irrelevant to the exact question, and if the classes in your dataset are imbalanced (as they seem to be), you could (and should) inform your models about this, using the class_weight
argument of ExtraTreesClassifier (docs), i.e. change the model in both cases to
ExtraTreesClassifier(class_weight='balanced')
(h/t to Ben Reiniger for a constructive comment that helped to improve the answer)
Correct answer by desertnaut on December 29, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP