Data Science Asked by Nick Bohl on April 18, 2021
I’m experiencing an issue with a RandomizedSearchCV grid that is not able to evaluate all of the fits. 50 of the 100 fits I’m calling do not get scored (score=nan), so I’m worried I’m wasting a bunch of time trying to run the gridsearch. I’m wondering how to troubleshoot this and haven’t found anything in the past few days and I’m hopeful that the community can help me squash this bug. Now, the details:
I have constructed a XGBClassifier model as such:
xgb_clf = xgb.XGBClassifier(tree_method="exact", predictor="cpu_predictor", verbosity=1,
objective="binary:logistic", scale_pos_weight= 1.64)
# my trainingset is imbalanced 85k majority class, 53k minority class
Currently, I am attempting to use the hashing trick to encode my categorical variables, as they are all nominal. I do this after splitting my training set into X and y variables
ce_hash = ce.HashingEncoder()
hashed_new = ce_hash.fit_transform(X)
hashed_X = hashed_new
I then conduct my train_test_split as normal, then instantiate a RandomizedSearchCV with a parameter grid, code is as such:
X_train, X_test, y_train, y_test = tts(hashed_X, y, test_size=.25)
# create my classifier
xgb_clf = xgb.XGBClassifier(tree_method="exact", predictor="cpu_predictor", verbosity=1,
objective="binary:logistic", scale_pos_weight= 4)
# Create parameter grid
params = {"learning_rate": [0.2, 0.1, 0.01, 0.001],
"gamma" : [10, 12, 14, 16],
"max_depth": [2, 4, 7, 10, 13],
"colsample_bytree": [ 0.8, 1.0, 1.2, 1.4],
"subsample": [0.8, 0.85, 0.9, 0.95, 1, 1.1],
"eta": [0.05, 0.1, .2, ],
"reg_alpha": [1.5, 2, 2.5, 3],
"reg_lambda": [0.5, 1, 1.5, 2],
"min_child_weight": [1, 3, 5, 7],
"n_estimators": [100, 250, 500]}
from sklearn.model_selection import RandomizedSearchCV
# Create RandomizedSearchCV Object
xgb_rscv = RandomizedSearchCV(xgb_clf, param_distributions=params, scoring='precision',
cv=10, verbose=3)
# Fit the model by running ten fits on ten 'folds', or 100 individual fits.
model_xgboost = xgb_rscv.fit(X_train, y_train)
However, during 50% of the 100 fits, I will get a score that looks like this:
[CV] subsample=0.8, reg_lambda=2, reg_alpha=3, n_estimators=100, min_child_weight=3, max_depth=10, learning_rate=0.001, gamma=16, eta=0.1, colsample_bytree=1.4, **score=nan**, total= 0.1s
When this occurs, it occurs in sections of ten, so 10 straight fits will all generate a score of nan. The 50 nan scores don’t always occur in the same order, but there are always 50 that don’t get scored correctly.
Would anyone know how I can attempt to correct this and ensure that all 100 fits get scored? Is this happening because I’m using a hashed feature set?
Thanks!
Some of your hyperparameter values aren't allowed (colsample_bytree
and subsample
cannot be more than 1), so probably xgboost errors out and sklearn helpfully moves on to the next point, recording the score as NaN.
Half of your values for colsample_bytree
are disallowed, which supports seeing half of your scores as NaN; and that will happen regardless of the fold, which explains why you always see them in groups of 10.
Correct answer by Ben Reiniger on April 18, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP