Bad classification performance of logistic regression on imbalanced data in testing as compared to training

Question

I am trying to fit a logistic regression model to an imbalanced dataset (0.5/99.5) with high dimensionality(about 15k). I used random forest to select top 200 important features. Observations are around 120K.
When I fit a logistic regression model on based dataset (using Smote for over sampling) , on training f1, recall and precision are good. But on testing, precision score and f1 are bad. I assume it makes sense because in training there were a lot more of the minority case while in reality/testing there is only very small percentage. So the algorithm is still looking for more minority cases, which caused the high false positive.
I was wondering what kind of methods could I try to improve the performance?
I am currently trying different sampling method for imbalanced dataset, also plan to try PCA.

D.W. · Accepted Answer

I suspect the reason is that the class balance in your test set is different from the class balance in your training set.  That will throw everything off.  The fundamental assumption made by statistical machine learning methods (including logistic regression) is that the distribution of data in the test set matches the distribution of data in the training set.  SMOTE can throw that off.

It sounds like you have used SMOTE to augment the training set by adding additional synthetic positive instances (i.e., oversampling the minority class) -- but you haven't added any negative instances.  So, the class balance in the training set might have shifted from 0.5%/99.5% to something like (say) 10%/90%, while the class balance in the test set remains 0.5%/99.5%.  That's bad; it will cause the classifier to over-predict positive instances.  For some classifiers, it's not a major problem, but I expect that logistic regression might be more sensitive to this mismatch between training distribution and test distribution.

Here are two candidate solutions for the problem that you can try:

Stop using SMOTE.  Ensure the training set has the same distribution as the test set.  SMOTE might actually be unnecessary in your situation.
Continue to augment the training set using SMOTE as you're currently doing, and compensate for the train/test mismatch by shifting the threshold for classification.  Logistic regression produces an estimated probability that a particular instance is from the positive class.  Typically, you then compare that probability to the threshold 0.5 and use that to classify it as positive or negative.  You can adjust the threshold to correct for that: replace $0.5$ with $0.5/k$, where $k$ is the ratio of positives in your training set after augmentation to positive before (e.g., if augmentation shifted the training set from 0.5%/99.5% to 10%/90%, then $k=10/0.5=20$); or you can use cross-validation to find a suitable threshold that maximizes the F1 score (or some other metric).

Incidentally, I recommend you make sure to use regularization with your logistic regression model, and use cross-validation to select the regularization hyper-parameter.  There's nothing wrong with 15K features if you have 120K instances in your training set, but you might want to regularize it strongly (choose a large regularization parameter) to avoid overfitting.

Finally, understand that dealing with severe class imbalance such as you have is just hard.  Fortunately, there are many techniques available.  Do some reading and research (including on Stats.SE) and you should be able to find other methods you could try, if these don't work well enough.

JahKnows · Answer

The dimensionality of your data is an important consideration here. Having 15K features will likely lead to very poor results. The higher dimensionality your features the more training examples you will need. For a shallow method such as logistic regression a general rule of thumb is to use $10times #features$. So unless you have over 150K examples, using 15K features is not recommended. Think to yourself what kinds of questions need to be answered in your data and how you can remodel your data to better answer those questions.

Furhtermore, logistic regression is not recommended for skewed datasets. There are many algorithms that are well suited to dealing with skewed dataset types of problems. Specifically, anomaly detection algorithms are capable of learning the distribution of a single set of labels (event not occurring) and then it will be able to flag when an anomaly occurs (event occurs). This is when an instance is sufficiently beyond the learned distribution. You can use this to get the probability of an event occurring based on a p-statistic test using the feature-space you have set up in contrast with those from your learned distribution.

The simplest method would be doing a generalized likelihood ratio test (GLRT). But, I think you will most likely find more luck using a K-NN based method for skewed datasets.

FrancoSwiss · Answer

I do the same dangerous approach.

The DANGER is that we do Feature Selection with a non-linear model (Random Forest) and apply a linear model (Logistic Regression).

Alternatives:
 - Try a tree-based algorithm OR
 - Use PCA which is linear and test Logistic Regression again.

Antalagor · Answer

You face three problems and here are my recommendations:

1. unbalanced classes
Logistic regression (unlike other methods) is very well capabable of handling imbalanced classes per se. There is the bias weight that shifts all the predictions around the correct mean. But it comes with some caveats mentioned in the paper below.

2. different class distribution in train/test data
First of all it is a sign of danger that you have different kind of data in train and test set. If your taining data does not represent your test (or more important) prediction situation very well, even the best model does not generalize and might make poor predictions. 
Anyway, you can change (or have different) class distribution in the training set and still obtain unbiased predictions. This can be done by introducing using small modifications to the prediction (or the model). For more see King, G., Zeng, L. (2001) ‘Logistic Regression in Rare Events Data’ Political Analysis, 9, Pp. 137–163.

3. potential overfitting
You should introduce regularization (l1/l2 a.k.a. lasso/ridge) and conduct a grid search to find the optimal hyper parameters. I prefer to use the optimizing algorithm itself to find the most important features w.r.t. explanatory power. You should only use unsupervised dimensionality reduction (like PCA) if you really need to simplify the optimization problem.

Bad classification performance of logistic regression on imbalanced data in testing as compared to training

4 Answers

Add your own answers!

Ask a Question