Should I oversample my validation data to get better F1 score and PRC?

Question

I am currently working with a dataset that is imbalanced, about 30k rows * 14 features (just for you know), and 99.5% of the data is labeled 0. Since the model is strongly imbalanced I decided to use precision/recall/f1 score to decide the performance of model.
I used SMOTE to oversample my training data (after splitting the validation set out). Now my model is trained with oversampled training data, and I am going to test it with validation set. If I just validate it on original validation data, I get a F1 score around 0.05, and the classification report is followed:
          precision    recall  f1-score   support

Class 0       1.00      0.86      0.93      7606
 Class 1       0.03      0.75      0.05        36

If I oversample my validation data, I get a F1 score around 0.85:
          precision    recall  f1-score   support

Class 0       0.84      0.86      0.85      7606
 Class 1       0.86      0.83      0.85      7606

My question is:

Should I use an oversampled validation set? (because the result is much prettier but I think the model is the same anyway)

Why do I have such bad metrics on the original validation data? Is it because the data size is not big enough?

FrancoSwiss · Answer

what you encounter are real-world problems rarely taught in classes.

For training, I would test SKLearn's class_weight = "balanced" or class_weight={0:0.995, 1:0.005}. It's a very robust technique.
For testing, you can't fiddle with the class_weight. It's meant to simulate the real-world data.
Make sure you don't overfit. E.G. Decision Tree max_leaf_node 5-10 OR (not both) max_depth 3-5.

Your results aren't that bad. A precision 1 of 0.03 is "low", but you only had 36 labels. Roughly speaking, the model labeled 1,000 as 1 (100/0.03 * 36). i.e. 97% false alarm. But with a good recall, you got most of the labels.
Putting it differently: in a dataset of around 8,000 you got 1,000 labeled as 1. That's around 12%. Out of those 1,000 there are about 30 real 1's. You get most of them (around 75%).
Imagine this was Churn Prediction. You don't want to lose customers. Without ML, you might send all your 8,000 customers a discount of 10%. That's expensive. With ML, you would only send 1,000 customers a discount and get most of them wanting to leave. That's a strong improvement over no model.
The same argument applies to many other cases such as Predictive Maintenance.

Should I oversample my validation data to get better F1 score and PRC?

One Answer

Add your own answers!

Ask a Question