Binary classification problem on imbalanced data

Question

Literature mainly says that in general is a good idea to apply some technique to balance the two classes. For a Neural Net, what is most important here? The fact of having an imbalanced dataset (freq. of classes) or the fact of not having enough positive class samples to learn from?
E.g. fraud detection problem. Consider 100k fraudulent actions and 1900k non-fraudulent examples, i.e., 5% of positive class examples. 100k seems a fair amount of data to learn from, from a conceptual point of view (i.e. before training and observing performance), would it make sense to not balance this data?

Shahriyar Mammadli · Answer

If you effectively balance your samples it would badly affect your overall result (or performance). If you would not use oversampling I think poor performance would be due to having an imbalanced dataset rather than not having enough positive class samples to learn from. This is because they would dominate your loss function.
However, I think you can try a simpler but powerful approach to model fraud detection (or similar Anomaly Detection problems). The highly underrated but very powerful approach is Gaussian Mixture Models. Based on the holy logic of Gaussian distribution, non-fraud samples will constitute your "normal" data and fraud will be your outliers. One example system is this.

Binary classification problem on imbalanced data

One Answer

Add your own answers!

Ask a Question