Does gradient boosting algorithm error always decrease faster and lower on training data?

Question

I am building another XGBoost model and I'm really trying not to overfit the data. I split my data into train and test set and fit the model with early stopping based on the test-set error which results in the following loss plot:

I'd say this is pretty standard plot with boosting algorithms as XGBoost. My reasoning is that my point of interest is mostly the performance on the test set and until the XGBoost stopped training around 600th epoch due to early stopping the test-set loss was still decreasing. On the other way, overfitting is sometimes defined as a situation when train error decreases faster than test error and this is exactly what happens here. But my intuition is that decision-tree based techniques always drill down the training data (for example, trees in random forest kind of deliberately overfit the training set and it's a role of bagging to reduce the variance). I believe gradient boosting techniques are also known from drilling down the training dataset pretty deep and even with low learning rate we can't help it.
But I may be wrong. Therefore, I'd like to confirm that this situation is not really something to worry about and I do not overfit the data. I'd also like to ask what the perfect learning curves plot would look like with gradient boosting techniques?

N. Kiefer · Answer

You should worry about overfitting when the test error rate starts to go up again. Until then I would set it aside. Overfitting is rather about the number of parameters, e.g. When two models with the same performance have different number of parameters you would prefer the one with less parameters to preserve the generalisation power of the model.
Generally the training error will be lower than the test error for obvious reasons (the model specifically sees the training data but not the test data), but even this can be reversed in some cases. For example when there is very little test data, or the test data has a different distribution than the training data.
A "perfect" learning curve doesn't exist, the final performance is what counts.

Carlos Mougan · Answer

I'd like to confirm that this situation is not really something to worry about and I do not overfit the data.

No, the situation is not worrying, you can consider it worrying when the test error starts increasing. Normally an optimal model is a bit overfitted.

decision-tree based techniques always drill down the training data

Yes, your intuition is right. The whole goal of bagging in Random Forest is to avoid the overfitting of an individual decision tree.

I believe gradient boosting techniques are also known from drilling down the training dataset pretty deep and even with low learning rate we can't help it.

A smaller learning rate will lead even more to overfitting after enough iterations. Same intuition as gradient descent.

XGBoost stopped training around 600th epoch due to early stopping

I am not sure if 'epoch' is the right term here (I might be wrong), but could be more iterations, as it is referring to the number of trees in the boosting ensemble.

The figure

What metric are you plotting? Are you sure it is a representative metric? What happens if you plot the rest of the binary classification metrics?

Does gradient boosting algorithm error always decrease faster and lower on training data?

2 Answers

Add your own answers!

Ask a Question