TransWikia.com

Cost sensitive learning and class balancing

Data Science Asked by A1010 on September 5, 2021

I am facing a classification problem with classes that are really imbalanced (more or less 1% of positive cases). In addition, the "cost" of a False Negative (FN) is much higher than the cost of False Positive (FP).

Considering so, I decided to insert the weights into my classification model. Which is the best way to compute these weights? Ideally, the weights have to take into account both the data unbalancing and the miss-classification cost.

I am not interested in putting a 0/1 label to each record, but just in ordering the test dataset according to the output score. The idea is to contact the records with the highest score to offer a product.

I would like to use these weights using, for example, the sample_weights parameter available in most of scikit-learn classification algorithm (here the documentation).

Is it a good idea to oversample/downsample the data and then use the weight only to control the miss-classification cost? Or would it be better to use weights that are able to consider the whole situation? Are there any known ways to compute these weights?

2 Answers

Not clear if you are saying the cost of a FN or a FP is higher, you only mention FN in your statement. Think that you mean a FN is more costly and that a positive means a 1.

In general, if an incorrect prediction for the minority case is more costly (FN), you should sample that minority case higher or the majority case lower so the ratio is closer to 1:1. Balancing will help increase the accuracy of your model when predicting the minority case. Accuracy in predicting the majority case will already be higher as there are more samples to use for that case. Undersampling, oversampling and SMOTE are all useful ways to accomplish this balancing of samples, and each has their own strengths and weaknesses.

However, doing this sample balancing will quickly increase the number of FP, so even though the cost may be lower for a FP, the cost will add up quickly. For example, every 1 FP that you decrease, you may get 10 or 20 more FN

After doing this balancing, you can start to adjust the weights to get the best ratio of FN to FP, trying to get the total cost as low as possible.

minimizing: total cost = FN x cost_of_fn + FP x cost_of_fp

Not sure if there is a mathematical equation to solve this, but you can run this iteratively, change weight ratios for the 2 classes, and calculate the total cost using a confusion matrix to get FN and FP, and graph the results for cost (y) vs weight ratio(x), looking for a minima. I would start with a ratio that is equal to the ratio of your costs.

Example: If the cost of a FN is 10 dollars and the cost of a FP is 1 dollar, then the ratio should be 10:1 for minority:majority class

A straightforward calculation shows how to derive the desired mix of class i: μ(i) in terms of the calculated weight w(i) and number of examples in the training set (before any up or down sampling) n(i). This formula is given below:

enter image description here

Reference: https://medium.com/rv-data/how-to-do-cost-sensitive-learning-61848bf4f5e7

Correct answer by Donald S on September 5, 2021

Supporting the second answer by Donald, I would also try to look at the predicted probabilities (via the predict_proba attribute in scikit learn classifiers) and customize your predictions by selecting the threshold (which you can check also in the corresponding ROC curve) which gives you the most robust (i.e. highest) predicted probabilities distributions, so your classifier is "sure" about what is predicting, something like: enter image description here here, the classifier on the right seems to give more robust predictions, although the confusion matrix on the right seems to be a little bit better).

You can also use metrics like recall, which gives a less optimistic and more realistic result (Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made, source: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/)

Answered by German C M on September 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP