What cost function and penalty are suitable for imbalanced datasets?

Question

For an imbalanced data set, is it better to choose an L1 or L2 regularization?

Is there a cost function more suitable for imbalanced datasets to improve the model score (log_loss in particular)?

Media · Answer

If you have unbalanced data, at first I recommend you try to have real data. I mean do not replicate your data by hand if you don't have balanced data. You should never ever change the distribution of your data. Although it may seem that you reduce the Bayse error, your classifier won't do well in your specified application. I have two suggestions for imbalanced data:

Use class weights to improve your cost function. For the rare class use a much larger value than the dominant class.
Use F1 score to evaluate your classifier

For an imbalanced set of data is it better to choose an L1 or L2 regularization

These are for dealing with over-fitting problem. First of all you have to learn the training data to solve high bias problem. The latter is more common in usual tasks. They are just fine for imbalanced data set but consider the point that first you have to deal with high bias problem, learning the data, then deal with high variance problem, avoiding over-fitting.

David Dale · Answer

So you ask how does class imbalance affect classifier performance under different losses? 
You can make a numeric experiment.

I do binary classification by logistic regression. However, the intuition extends on the broader class of models, in particular, neural networks. I measure performance by cross-validated ROC AUC, because it is insensitive to class imbalance. I use an inner loop of cross validation to find the optimal penalties for L1 and L2 regularization on each dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
import matplotlib.pyplot as plt

cvs_no_reg = []
cvs_lasso = []
cvs_ridge = []
imb = [0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.02, 0.01, 0.005, 0.002, 0.001]
Cs = [1e-5, 3e-5, 1e-4, 3e-4, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1]
for w in imb:
    X, y = make_classification(random_state=1, weights=[w, 1-w], n_samples=10000)
    cvs_no_reg.append(cross_val_score(LogisticRegression(C=1e10), X, y, scoring='roc_auc').mean())
    cvs_ridge.append(cross_val_score(LogisticRegressionCV(Cs=Cs, penalty='l2'), X, y, scoring='roc_auc').mean())
    cvs_lasso.append(cross_val_score(LogisticRegressionCV(Cs=Cs, solver='liblinear', penalty='l1'), X, y, scoring='roc_auc').mean())

plt.plot(imb, cvs_no_reg)
plt.plot(imb, cvs_ridge)
plt.plot(imb, cvs_lasso)
plt.xscale('log')
plt.xlabel('fraction of the rare class')
plt.ylabel('cross-validated ROC AUC')
plt.legend(['no penalty', 'ridge', 'lasso'])
plt.title('Sensitivity to imbalance under different penalties')
plt.show()

You can see that under high imbalance (left-hand side of the picture) L1 regularization performs better than L2, and both better than no regularization.

But if the imbalance is not so serious (the smallest class share is 0.03 and higher), all the 3 models perform equally well.

As for the second question, what is a good loss function for imbalanced datasets, I will answer that log loss is good enough. Its useful property is that it doesn't make your model turn the probability of a rare class to zero, even if it is very very rare.

Leon palafox · Answer

If you have an imbalanced dataset you usually want to make it balanced to begin with, since that will artificially affect your scores.
Now, you want to be measuring precision and recall, since those can capture a bit better the imbalanced dataset biases.
L1 or L2 won't perform particularly better in a balanced or unbalanced dataset, what you want to do is call elastic nets (which is a combination of the two) and do cross validation over the coefficients of each of the regularizers.
Also, doing grid search is very odd, you are better using just cross validation and see what parameters work better.
They even have ElasticNetCV, which does that part for you

What cost function and penalty are suitable for imbalanced datasets?

3 Answers

Add your own answers!

Ask a Question