TransWikia.com

XGBoost Log Loss different from GridSearchCV Log Loss

Data Science Asked by Sean O'Connor on June 6, 2021

I have a classification problem where i am trying to predict if the data returns a 1 or 0. So you’re classic binary classification. I have my set of data that I have split into the dependent variables (ones i’m training on) and the independent variable (my target that i’m predicting, either a 0 or 1). I am using log loss as the scoring metric for my model.

Firstly I am using the cv function in xgboost to figure out the number of estimators I need as it stops when the log loss hasn’t improved over 50 rounds. I then train my model and predict. My code is below:

def modelfit(alg, dtrain, dtarget, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

    if useTrainCV:
        # gets the xgb parameters specifically.
        xgb_param = alg.get_xgb_params()

        # this is the internal xgb dataframe that is for efficiency. We map the training data to the labels.
        xgtrain = xgb.DMatrix(dtrain.values, label=dtarget)

        # this performs cross validation on the dataset. As our data is not really time dependent we can afford to cross
        # validate. It stops when it hasnt improved for 50 rounds. This is only for determining n_estimators
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='logloss', early_stopping_rounds=early_stopping_rounds)

        print(f'Optimal n_estimators - {cvresult.shape[0]}')

        # this sets the most optimal n_estimators parameter into the booster.
        alg.set_params(n_estimators=cvresult.shape[0])

    # fit the algorithm on the data and set evaluation metric
    alg.fit(dtrain.values, dtarget, eval_metric='logloss', eval_set=[(dtrain.values, dtarget)])

    print(alg.evals_result())

    # predict training set:
    dtrain_predictions = alg.predict(dtrain.values)
    print(dtrain_predictions)
    dtrain_predprob = alg.predict_proba(dtrain.values)[:,1]

    # print model report:
    print("nModel Report")
    print("Log Loss Score (Train): %f" % metrics.log_loss(dtarget, dtrain_predprob))

I then run this function on this particular XGBoostClassifier:

#Choose all predictors
xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 scale_pos_weight=1, 
 nthread=-1,
 seed=27)

modelfit(xgb1, X, y)

The logloss value that is returned is: 0.577496 and number of estimators is 65.

I then turn to GridSearchCV to tune the other parameters and I start with:

param_test1 = {
 'max_depth' : range(1,10),
 'min_child_weight' : range(1,6)
}

Note how the original max depth and min child weight are contained within these ranges that i used in xgb1 classifier.

xgb2 = XGBClassifier(
        learning_rate =0.1, 
        n_estimators=65,
        max_depth=5,
        min_child_weight=1, 
        gamma=0, 
        subsample=0.8, 
        colsample_bytree=0.8,
        objective= 'binary:logistic',
        nthread=-1,
        scale_pos_weight=1, 
        seed=27
)

gsearch1 = GridSearchCV(
    estimator = xgb2, 
    param_grid = param_test1, scoring='neg_log_loss', n_jobs=-1, cv=5
)

gsearch1.fit(X, y)
gsearch1.best_params_, gsearch1.best_score_

However this returns me with:

(
{'max_depth': 1, 'min_child_weight': 1}, -0.6275341839742403
)

So my question is how has the grid search said the best parameters are max_depth = 1 and min_child_weight = 1 and the log loss is 0.628 when previously before using GridSearchCV my model returned a better log loss of 0.577 with max_depth = 5 and min_child_weight = 1?

Any help would be appreciated please. Thanks!

One Answer

Your modelfit prints the training score, but GridSearchCV bases its decisions on the out-of-fold average (and in particular best_score_ is an out-of-fold average score). This is an unfair comparison, and in particular your 0.577 is probably quite optimistically biased.

Answered by Ben Reiniger on June 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP