Data Science Asked by RafalQA on January 6, 2021
I am using Xgboost for classification. My y
is 0 or 1 (true or false). I have categorical and numeric features, so theoretically, I need to use SMOTE-NC instead of SMOTE. However, I get better results with SMOTE.
Could anyone explain why this is happening?
Also, if I use some encoder (BinaryEncoder, one hot, etc.) for categorical data, do I need to use SMOTE-NC after encoding, or before?
I copied my example code (x
and y
is after cleaning, include BinaryEncoder).
_train, X_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=1)
smt = SMOTE()
X_resampled, y_resampled = smt.fit_resample(X_train, y_train)
params_model1 = {
'booster': ['dart', 'gbtree', 'gblinear'],
'learning_rate': [0.001, 0.01, 0.05, 0.1],
'min_child_weight': [1, 5, 10, 15, 20],
'gamma': [0, 0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5, 6, 7, 8],
'max_delta_step': [0, 1, 2, 3, 5, 10],
'base_score': [0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65],
'reg_alpha': [0, 0.5, 1, 1.5, 2],
'reg_lambda': [0, 0.5, 1, 1.5, 2],
'n_estimators': [100, 200, 500]
}
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1001)
xgb = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.3, gamma=1,
learning_rate=0.1, max_delta_step=0, max_depth=10,
min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=0.8, verbosity=1)
scoring = 'f1'
rs_xgb = RandomizedSearchCV(xgb, param_distributions=params_model1, n_iter=1,
scoring=scoring, n_jobs=4, cv=skf.split(X_resampled, y_resampled), verbose=3,
random_state=1001)
rs_xgb.fit(X_resampled, y_resampled)
refit = rs_xgb.best_estimator_
joblib.dump(refit, 'validator1.pkl')
loaded_xgb = joblib.load('validator1.pkl')
y_predict = loaded_xgb.predict(X_val.as_matrix())
print(confusion_matrix(y_val, y_predict))
print("Final result " + str(f1_score(y_val, y_predict)))
You have to keep in mind that machine learning is still largely an empirical field, full of ad-hoc approaches that, while they happen to work well in most cases, they lack a theoretical explanation as to why they do so.
SMOTE arguably falls under this category; there is absolutely no guarantee (theoretical or otherwise) that SMOTE-NC will work better for your data compared to SMOTE, or even that SMOTE will perform better compared with much simpler approaches, like oversampling/undersampling. Quoting from section 6.1 on SMOTE-NC of the original SMOTE paper (emphasis added):
SMOTE-NC with the Adult dataset differs from our typical result: it performs worse than plain under-sampling based on AUC. [...] even SMOTE with only continuous features applied to the Adult dataset, does not achieve any better performance than plain under-sampling.
The authors proceed to offer some possible explanations as to why they see such not-typical performance with SMOTE/SMOTE-NC on the said dataset, but as you will see this has to do with a deep focus on the dataset itself and its characteristics, and it is itself rather empirical in nature and hardly "theoretical".
Bottom line: there is not really much to be explained here regarding your question; any further detail will require going deep with the specific characteristics of your dataset, which of course is not possible here. But I would suggest to not bother, and continue guided by your experimental results, rather than by any (practically non-existent) theory on the subject...
Correct answer by desertnaut on January 6, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP