Why does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?

Question

As I increase the number of trees in scikit learn's GradientBoostingRegressor, I get more negative predictions, even though there are no negative values in my training or testing set. I have about 10 features, most of which are binary.

Some of the parameters that I was tuning were:

the number of trees/iterations;
learning depth;
and learning rate.

The percentage of negative values seemed to max at ~2%. The learning depth of 1(stumps) seemed to have the largest % of negative values. This percentage also seemed to increase with more trees and a smaller learning rate. The dataset is from one of the kaggle playground competitions.

My code is something like:

from sklearn.ensemble import GradientBoostingRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y)

reg = GradientBoostingRegressor(n_estimators=8000, max_depth=1, loss = 'ls', learning_rate = .01)

reg.fit(X_train, y_train)

ypred = reg.predict(X_test)

lejlot · Answer

In general regression models (any) can behave in an arbitrary way beyond the domain spanned by training samples. In particular, they are free to assume linearity of the modeled function, so if you for instance train a regression model with points:

X     Y
10    0
20    1
30    2

it is reasonable to build a model f(x) = x/10-1, which for x<10 returns negative values.

The same applies "in between" your data points, it is always possible that due to the assumed famility of functions (which can be modeled by particular method) you will get values "out of your training samples".

You can think about this in another way - "what is so special about negative values?", why do you find existance of negative values weird (if not provided in training set) while you don't get alarmed by existance of lets say... value 2131.23? Unless developed in such a way, no model will treat negative values "different" than positive ones. This is just a natural element of the real values which can be attained as any other value.

Milad Shahidi · Answer

Remember that the GradientBoostingRegressor (assuming a squared error loss function) successively fits regression trees to the residuals of the previous stage. Now if the tree in stage i predicts a value larger than the target variable for a particular training example, the residual of stage i for that example is going to be negative, and so the regression tree at stage i+1 will face negative target values (which are the residuals from stage i).
As the boosting algorithm adds up all these trees to make the final prediction, I believe this can explain why you may end up with negative predictions, even though all the target values in the training set were positive, especially as you mentioned that this happens more often when you increase the number of trees.

Orchid Chetia Phukan · Answer

The default number of estimators is 100. Reducing the number of estimators may work.

Why does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?

3 Answers

Add your own answers!

Ask a Question