Data Science Asked by Vikrant Arora on December 9, 2020
I want to know how L1 & L2 regularization works in Light GBM and how to interpret the feature importances.
Scenario is: I used LGBM Regressor with RandomizedSearchCV (cv=3, iterations=50) on a dataset of 400000 observations & 160 variables. In order to avoid overfitting/reguralize I provided below ranges for alpha/L1 & lambda/L2 parameters and the best values as per Randomized search are 1 & 0.5 respectively.
‘reg_lambda’: [0.5, 1, 3, 5, 10]
‘reg_alpha’: [0.5, 1, 3, 5, 10]
Now my question is about: Feature importance values with optimized values of reg_lambda=1 & reg_alpha=0.5 are very different from that without providing any input for reg_lambda & alpha. The regularized model considers only top 5-6 features important and makes importance values of other features as good as zero (Refer images). Is that a normal behaviour of L1/L2 regularization in LGBM?
Further explaining the LGBM output with L1/L2: The top 5 important features are same in both the cases (with/without regularization), however importance values after top 2 features has been shrunk significantly by the L1/L2 regularized model and after top 5 features the regularized model makes importance values as good as zero (Refer images of feature importance values in both cases).
Another related question I have is: How to interpret the importance values and when I run the LGBM model with Randomized search cv best parameters do I need to remove the features with low importance values & then run the model? OR should I run with all the features & the LGBM algorithm (with L1 & L2 regularization) will take care of low importance features and won’t give them any weight or may be give minute weight when it makes predictions.
Any help will be highly appreciated.
Regards
Vikrant
With regularization, LightGBM
"shrinks" features which are not "helpful". So it is in fact normal, that feature importance is quite different with/without regularization. You don't need to exclude any features since the purpose of shrinking is to use features according to their importance (this happens automatically).
In your case the top two features seem to have good explanatory power, so that they are used as "most important" features. Other features are less important and are therefore "shrunken" by the model.
You may also find that different features pop up as top of the list (the list may look different in general) when you run the model multiple times. This is because (if you don't fix a seed), the model will take different pathes to obtain a best fit (so the whole thing is not deterministic).
Overall you should get a better fit with regularization (otherwise there is little need for it).
I wonder if it makes sense to use both (l1
and l2
)!? L1
(aka reg_alpha
) can shrink features to zero while l2
(aka reg_lambda
) does not. I usually only use one of the parameters. Unfortunately, the documentation does not provide too much details here.
Correct answer by Peter on December 9, 2020
Here's a link to a good answer for the follow up question of "should you use both L1 and L2 regularization terms?" Summarized briefly here:
Answered by Kevin2342 on December 9, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP