Metric for label imbalance

Question

I'm looking for a metric that can be used to quantify how imbalanced the labels are in a dataset.
I'm not looking for a strategy to solve the imbalance problem, I just want to present how imbalanced my dataset is. I've computed the ratio of the most frequent and least frequent labels which is probably an ok way of doing it but I'm sure there's a more robust way?

Abhishek Verma · Accepted Answer

You are looking for Entropy. The higher the entropy, the more imbalanced it is. You can use this function for calculating it.

Erwan · Answer

A very simple measure of imbalance would be the standard deviation of the classes proportions.

Since it's based on proportions one can compare the imbalance between different datasets
This takes into account all the classes, so if there are many classes it would give a different value depending on whether there are many small and many large classes (higher imbalance overall) or if there is only one outlier class (lower imbalance overall).

kfx · Answer

I'd recommend looking at the Gini index as a measure of the inequality in the class sizes. Unlike entropy or standard deviation, Gini index is explicitly designed to capture the amount of inequality in a distribution.

Answered by kfx on May 28, 2021

Metric for label imbalance

3 Answers

Add your own answers!

Ask a Question