Data Science Asked by zonna on April 21, 2021
I was training a model using XGBoost Classifier on heavy imbalanced data base with 232:1 of binary class. Because my training data contains 750k rows and 320 features (after doing many feature engineering, feature correlation filtering and low variance filtering), I prefer to use scale_pos_weight to dealing with imbalanced rather than oversampling data. After parameter tuning using Bayesian optimization to optimize PR AUC with 5 fold cross validation, I got the best cross validation score as below:
PR AUC = 4.87%, ROC AUC = 78.5%, Precision = 1.49%, and Recall = 80.4%
and when I tried to implement the result to a testing dataset the result is below:
accuracy: 0.562
roc_auc: 0.776293
pr_auc: 0.032544
log_loss: 0.706263
F1: 0.713779
Confusion Matrix:
[[9946 7804]
[ 18 84]]
precision recall f1-score support
0 1.00 0.56 0.72 17750
1 0.01 0.82 0.02 102
accuracy 0.56 17852
macro avg 0.50 0.69 0.37 17852
weighted avg 0.99 0.56 0.71 17852
My parameter range to be optimize (consume 2-3 days with 100 iteration) is:
{'learning_rate':(0.001,0.2),'min_split_loss':(0,20),'max_depth':(3,10),'min_child_weight':(0,50),'max_delta_step':(0,10),'subsample':(0.5,1),'colsample_bytree':(0.5,1),'colsample_bynode':(0.5,1),'colsample_bylevel':(0.5,1),'reg_lambda':(1e-5,100),'reg_alpha':(0,1), 'objective':'binary:logistic','booster':'gbtree','scale_pos_weight':232,'n_estimators':200}
According to business request, we have more consideration to high recall (to save those in positive class), however I am frustrated by too low precision result (this is impact to the cost to save positive class). Is there any solution to increase the precision at least to be 10% without hurting the Recall?
Given that both the f1-score
and PR AUC
are very low even for the prevalence of ~0.45%
, it can not be deduced if the limitations are imposed by the nature of the data or the model (features plus the algorithm used).
In order to build a better understanding and to resolve the issue, I would suggest to break the problem into two parts:
80-20?
both for training and testing. Once you are satisfied with the performance of your approach, move to 2 belowAnswered by jdsuryap on April 21, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP