Light GBM Regressor, L1 & L2 Regularization and Feature Importances

Question

I want to know how L1 & L2 regularization works in Light GBM and how to interpret the feature importances.

Scenario is: I used LGBM Regressor with RandomizedSearchCV (cv=3, iterations=50) on a dataset of 400000 observations & 160 variables. In order to avoid overfitting/reguralize I provided below ranges for alpha/L1 & lambda/L2 parameters and the best values as per Randomized search are 1 & 0.5 respectively.

'reg_lambda': [0.5, 1, 3, 5, 10]
 'reg_alpha': [0.5, 1, 3, 5, 10]

Now my question is about: Feature importance values with optimized values of reg_lambda=1 & reg_alpha=0.5 are very different from that without providing any input for reg_lambda & alpha. The regularized model considers only top 5-6 features important and makes importance values of other features as good as zero (Refer images). Is that a normal behaviour of L1/L2 regularization in LGBM?

Further explaining the LGBM output with L1/L2: The top 5 important features are same in both the cases (with/without regularization), however importance values after top 2 features has been shrunk significantly by the L1/L2 regularized model and after top 5 features the regularized model makes importance values as good as zero  (Refer images of feature importance values in both cases).

Another related question I have is: How to interpret the importance values and when I run the LGBM model with Randomized search cv best parameters do I need to remove the features with low importance values & then run the model? OR should I run with all the features & the LGBM algorithm (with L1 & L2 regularization) will take care of low importance features and won't give them any weight or may be give minute weight when it makes predictions.

Any help will be highly appreciated.

Regards
Vikrant

Peter · Accepted Answer

With regularization, LightGBM "shrinks" features which are not "helpful". So it is in fact normal, that feature importance is quite different with/without regularization. You don't need to exclude any features since the purpose of shrinking is to use features according to their importance (this happens automatically).

In your case the top two features seem to have good explanatory power, so that they are used as "most important" features. Other features are less important and are therefore "shrunken" by the model.

You may also find that different features pop up as top of the list (the list may look different in general) when you run the model multiple times. This is because (if you don't fix a seed), the model will take different pathes to obtain a best fit (so the whole thing is not deterministic).

Overall you should get a better fit with regularization (otherwise there is little need for it).

I wonder if it makes sense to use both (l1 and l2)!? L1 (aka reg_alpha) can shrink features to zero while l2 (aka reg_lambda) does not. I usually only use one of the parameters. Unfortunately, the documentation does not provide too much details here.

Kevin2342 · Answer

Here's a link to a good answer for the follow up question of "should you use both L1 and L2 regularization terms?"  Summarized briefly here:

These lightGBM L1 and L2 regularization parameters are related leaf scores, not feature weights.  The regularization terms will reduce the complexity of a model (similar to most regularization efforts) but they are not directly related to the relative weighting of features.
In general L1 penalties will drive small values to zero whereas L2 penalties will reduce the weight of large outlier values.  If both of those things sound good for leaf scores then you may get a benefit from using both.  ElasticNet uses both L1 and L2 penalties but that is on feature weights so it is somewhat different.

https://datascience.stackexchange.com/a/57450/102940

Light GBM Regressor, L1 & L2 Regularization and Feature Importances

2 Answers

Add your own answers!

Ask a Question