Negative values in XGBoost regression

Data Science Asked by Aastha Jha on October 28, 2020

I am trying to perform regression using XGBoost. My dataset has all positive values but some of the predictions are negative.

I read on this link that reducing the number of trees might help the situation.

I reduced the estimators from 700 to 570 and the number of negative predictions decreased but is there any way to remove these negative predictions? When I further tried reducing the estimators to 400, I got worse results with high rmse and more negative predictions.

I looked at the negative predictions to understand where the model went wrong(see below image):
enter image description here

It looks like if I take the absolute value of the negative predictions then that would be sufficient but is this the right method?

My code is:

!pip install xgboost
!pip install scikit-optimize
from xgboost import XGBRegressor

final_XGB=XGBRegressor(random_state=123,gamma= 24.47 ,learning_rate=0.1235,max_depth=10,min_child_weight=0.21509999999999999,

from sklearn.model_selection import cross_validate

cross_val_scores=cross_validate(final_XGB,X_train,y_train,cv=3,scoring=['neg_mean_squared_error','r2'],verbose=1,return_train_score=True,n_jobs=-1 )

cross_val_scores['test_r2'].mean() #returns 0.9595609470775659


A bit more about the dataset. I am trying to predict the number of people that would be present at a given location at a given time period.

In order to predict the count of people, I have taken the

  • hour of day,
  • date,
  • peak(is it a public holiday?),
  • Weekend(1/0),
  • Area (area of location in m2)
  • location name.

The location names were string values (eg: Location1, Location2, etc.) so I transformed them using JamesSteinEncoder.

There are no missing values and I have removed a few outliers from the dataset.

Regarding the relationship between "Actual" and "Error" column, I have plotted the graph below:

enter image description here

Snapshot of final results:

enter image description here

One Answer

Gradient boosting machines can return values outside of the training range. Have a look at this post Can Boosted Trees predict below the minimum value of the training label?

In practice this is unlikely to happen, but it can be the case for your data.

If this is happening probably what it means is that your training data and the one you are evaluating are different.

Answered by Carlos Mougan on October 28, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP