TransWikia.com

Confusion Matrix before and after SMOTE is same

Data Science Asked on June 28, 2021

I am working with a very unbalanced dataset and I used SMOTE (for training data only). However, I did not understand why the results before and after SMOTE are the same.

The attached confusion matrix is the same before and after SMOTE.

Is that normal? And what is the interpretation of this matrix?

enter image description here

Edit :
This is my classification report, as you can see one of the class is neglected by the model :

enter image description here

3 Answers

Is a lot there to explain of “why” when u basically do not provide the code or what u did there.

Also based on your CM I assume your model is doing just what is trained to do, u do get good performance of the model u fit, letting aside that minority classes are ignored.

Based on my experience I would use sample size weight for the models instead of creating synthetic observations, especially when u dealing with rare events (your models is classifying just as u tell him to do because is not penalized for miss classification so does focus on your “metric” accuracy when u train).

Note: CM explanation (seems does treat all labels on “equal” foot probability):

  • your model does classify ~98.6 of the observations that are true label “Benign” and are predicted as “Benign”. Other ones are missed so does predict all other labels also as “Benign” when their true labels are “others” from tour list (didn’t list them here because answering from my phone).

Hope it helps.

Answered by n1tk on June 28, 2021

From your confusion matrix, your model only predict Benign class. It seems that you have a degenerate model. It means that there is a problem somewhere, either your model hasn't learned at all or it has learned too well based on a poorly chosen metric. This doens't really appears to be linked to SMOTE, but seems to be about your calibration process. We can't really help you more than that without further details.

Answered by lcrmorin on June 28, 2021

Your model is probably overfitted now because you are adding too much synthetic data probably "repeating" some. For this kind of dataset SMOTE is not enough, remember the techniques you have for fighting imbalancement, usually one needs to apply many of them.

You perform this pipeline: 1.- Apply Tomek links to remove as much as you can from the boundaries of the majoritarian class. 2.- Check to see if you can randomly undersample some (not tons) of majority class data. 3.- Perform upsampling (like SMOTE) with caution for not to overfit the model again, don't get obsessed to have a balanced dataset, generate part not all. 4.- Use focal loss (depending on the algorithm) based on the class proportions.

This pipeline should help.

Answered by Eduardo Di Santi Grönros on June 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP