TransWikia.com

High Recall but too low Precision result in imbalanced data

Data Science Asked by zonna on April 21, 2021

I was training a model using XGBoost Classifier on heavy imbalanced data base with 232:1 of binary class. Because my training data contains 750k rows and 320 features (after doing many feature engineering, feature correlation filtering and low variance filtering), I prefer to use scale_pos_weight to dealing with imbalanced rather than oversampling data. After parameter tuning using Bayesian optimization to optimize PR AUC with 5 fold cross validation, I got the best cross validation score as below:
PR AUC = 4.87%, ROC AUC = 78.5%, Precision = 1.49%, and Recall = 80.4%
and when I tried to implement the result to a testing dataset the result is below:

accuracy: 0.562
roc_auc: 0.776293
pr_auc: 0.032544
log_loss: 0.706263
F1: 0.713779
Confusion Matrix:    
[[9946 7804]
 [  18   84]]
          precision    recall  f1-score   support

       0       1.00      0.56      0.72     17750
       1       0.01      0.82      0.02       102

    accuracy                           0.56     17852
   macro avg       0.50      0.69      0.37     17852
weighted avg       0.99      0.56      0.71     17852

My parameter range to be optimize (consume 2-3 days with 100 iteration) is:

{'learning_rate':(0.001,0.2),'min_split_loss':(0,20),'max_depth':(3,10),'min_child_weight':(0,50),'max_delta_step':(0,10),'subsample':(0.5,1),'colsample_bytree':(0.5,1),'colsample_bynode':(0.5,1),'colsample_bylevel':(0.5,1),'reg_lambda':(1e-5,100),'reg_alpha':(0,1), 'objective':'binary:logistic','booster':'gbtree','scale_pos_weight':232,'n_estimators':200}

According to business request, we have more consideration to high recall (to save those in positive class), however I am frustrated by too low precision result (this is impact to the cost to save positive class). Is there any solution to increase the precision at least to be 10% without hurting the Recall?

One Answer

Given that both the f1-score and PR AUC are very low even for the prevalence of ~0.45%, it can not be deduced if the limitations are imposed by the nature of the data or the model (features plus the algorithm used).

In order to build a better understanding and to resolve the issue, I would suggest to break the problem into two parts:

  1. Build a model that works for the selected features. For this purpose, you may try creating a slightly balanced dataset 80-20? both for training and testing. Once you are satisfied with the performance of your approach, move to 2 below
  2. Use your original imbalanced dataset and see if the situation is any better. If not, it is now clearer that the issue is with the imbalanced nature of the data, and you should try all standard techniques for dealing with imbalanced classes. I hope this helps you, because otherwise only options at hand are synthesizing or collecting more data for the minority class.

Answered by jdsuryap on April 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP