TransWikia.com

Metric for label imbalance

Data Science Asked on May 28, 2021

I’m looking for a metric that can be used to quantify how imbalanced the labels are in a dataset.

I’m not looking for a strategy to solve the imbalance problem, I just want to present how imbalanced my dataset is. I’ve computed the ratio of the most frequent and least frequent labels which is probably an ok way of doing it but I’m sure there’s a more robust way?

3 Answers

You are looking for Entropy. The higher the entropy, the more imbalanced it is. You can use this function for calculating it.

Correct answer by Abhishek Verma on May 28, 2021

A very simple measure of imbalance would be the standard deviation of the classes proportions.

  • Since it's based on proportions one can compare the imbalance between different datasets
  • This takes into account all the classes, so if there are many classes it would give a different value depending on whether there are many small and many large classes (higher imbalance overall) or if there is only one outlier class (lower imbalance overall).

Answered by Erwan on May 28, 2021

I'd recommend looking at the Gini index as a measure of the inequality in the class sizes. Unlike entropy or standard deviation, Gini index is explicitly designed to capture the amount of inequality in a distribution.

Answered by kfx on May 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP