Relatively high regularization parameters for XGBoost model only way to prevent overfitting

Question

I am modeling a continuous regression/forecasting problem for very right-skewed data. I've been using ElasticNet and Huber regression with quite a bit of success, and have recently moved into using XGBoost to see if it'll provide any additional value. The dimensions of my training matrix is 60,000 rows by 500 columns.
What I've found is that the much simpler, more interpretable ElasticNet/Huber regression models very often outperform any XGBoost model I've built. The only way I can get XGBoost to compete is by using a ton of different forms of regularization. In particular: the most performant XGBoost models have had reg_alpha/reg_lambda parameters in the [10-150] range; gamma in the [25, 100] range, subsample of 0.5, colsample_by_tree of 0.5, and shallow max_depths, e.g. 3/4/5, with around 150 n_estimators.
From what I've gathered in various tutorials online, gamma values over 10 or 20 seem to be very high, although I completely acknowledge that statement could be very dependent on the characteristics of the dataset being used.
For this super-regularized model, the predictions and feature importances make sense from an intuitive perspective.
I guess I'm just looking for some input – is it insane that I have such high regularization parameters, or am I more justified than once thought in these high values, since the proof seems to be in the pudding with the model's predictive power/generalizability and important features?

Ben Reiniger · Answer

I support your "proof is in the pudding" sentiment.
Some of those hyperparameters are not that extreme, in my experience.  Boosted trees very often perform best with weak individual learners; your max_depth is right in line with what I'm used to seeing as best.  The score regularization penalties (alpha, lambda) don't play as important a role in my experience, but I'm used to seeing optimal parameters chosen in the high double-digits.  Your subsampling and column subsetting rates also seem reasonable, if on the lower end of what I've generally seen as being optimal.  Your gamma is quite high, but that doesn't mean something is wrong; perhaps if you shrink the max depth a bit you could relax the gamma regularization, but I don't think that's in any way necessary.
One possible explanation for this situation: your data is relatively linear and without interactions, so that xgboost doesn't get its main benefits.  And your data is noisy enough that, lacking those nonlinear trends, xgboost ends up fitting to noise readily unless you strongly regularize it.

Relatively high regularization parameters for XGBoost model only way to prevent overfitting

One Answer

Add your own answers!

Ask a Question