TransWikia.com

Gini impurity in decision tree (reasons to use it)

Data Science Asked by Gregwar on February 28, 2021

In a decision tree, Gini Impurity[1] is a metric to estimate how much a node contains different classes. It measures the probability of the tree to be wrong by sampling a class randomly using a distribution from this node:

$$
I_g(p) = 1 – sum_{i=1}^J p_i^2
$$

If we have 80% of class C1 and 20% of class C2, labelling randomly will then yields 1 – 0.8 x 0.8 – 0.2 x 0.2 = 0.32 Gini impurity value.

However, assigning randomly a class using the distribution seems like a bad strategy compared with simply assigning the most represented class in this node (in above example, you would just label C1 all the time and get only 20% of error instead of 32%).

In that case, I would be tempted to simply use this as a metric, since it is also the probability of mislabeling :

$$
I_m(p) = 1 – max_i [ p_i]
$$

Is there a deeper reason to use Gini and/or a good reason not to use this approach instead ? (In other words, Gini seems to over-estimate the mislabellings that will happen, isn’t it ?)

[1] https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity


EDIT: Motivation

Suppose you have two classes $C_1$ and $C_2$, with probabilities $p_1$ and $p_2$ ($1 ge p_1 ge 0.5 ge p_2 ge 0$, $p_1 + p_2 = 1$).

You want to compare strategy "always label $C_1$" with strategy "label $C_1$ with $p_1$ probability, and $C_2$ with $p_2$ probability", thus the probability of success are respectively $p_1$ and $p_1^2 + p_2^2$.

We can rewrite this second one to:

$$
p_1^2 + p_2^2 = p_1^2 + 2p_1p_2 – 2p_1p_2 + p_2^2 = (p_1 + p_2)^2 – 2p_1p_2 = 1 – 2p_1p_2
$$

Thus, if we substract it to $p_1$:

$$
p_1 – 1 + 2p_1p_2 = 2p_1p_2 – p_2 = p_2 ( 2p_1 – 1)
$$

Since $p_1 ge 0.5$, then $p_2 ( 2p_1 – 1) ge 0$, and thus:

$$
p_1 ge p_1^2 + p_2^2
$$

So choosing the class with highest priority is always a better choice.


EDIT: Choosing an attribute

Suppose now we have $n_1$ items in $C_1$ and $n_2$ items in $C_2$. We have to choose which attribute $a in A$ is the best to split the node. If we use superscript $n^v$ for number of items that have a value $v$ for a given attribute (and $n^v_1$ items of $C_1$ that have value $v$), I propose we use the score:

$$
sum_v frac{n^v}{n_1 + n_2} frac{max(n^v_1, n^v_2)}{n^v_1 + n^v_2}
$$

As a criterion instead of Gini.

Note, since $n^v = n^v_1 + n^v_2$ and $n_1 + n_2$ doesn’t depend on the choosen attribute, this can be rewritten:

$$
sum_v max(n^v_1, n^v_2)
$$

And simply interpreted as the number of items in the dataset that will be properly classified.

2 Answers

Since I only had a gut feeling about this, I decided to implement it in code. For this, I followed the method described at https://victorzhou.com/blog/gini-impurity/. Generally, calculating the GI provides you with a metric for which you don't have to know the underlying distribution which I think is the reason why your example works.

Generally, my conclusion is "Yes, there are situations in which a majority count is more useful, but usually, the GI usually produces a better separation".

Here is what I've done, following the link mentioned above:

  1. I have generated 4 x 2000 data points with random x/y coordinates.
  2. Based on whether the result of np.sqrt((data.p_x*data.p_y)) was bigger or smaller than a cutoff, I have assigned either blue or red as colors.
  3. Then, for each set of 2000 points, I have drawn 100 vertical lines representing potential splits a decision tree node could evaluate.
  4. Then, for each of these splits, I followed the procedure described in https://victorzhou.com/blog/gini-impurity/#example-1-the-whole-dataset. So I basically evaluated the Gini Impurity for both sides of the split, then calculated a weighted sum of these per split and calculated the "Gini Gain" by subtracting that from the naive probability of being right.
  5. In the lower-center plot, I have plotted the gain for each of these splits against the ratio of correct points this split would be able to classify. Also, for each set I have added a horizontal line indicating the number of correctly classified points if I just were to go for majority voting.

I simplified the real world by only using vertical splits, shown in light grey. Also, I'm approaching a 2D problem basically in 1D. The horizontal lines correspond be the accuracy of the majority vote principle you proposed. enter image description here

I uploaded the code here: https://github.com/dazim/stackoverflow_answers/blob/main/gini-impurity-in-decision-tree-reasons-to-use-it.ipynb

Edit 2021-02-19:

Based on the discussion in the comments I replaced the GI with the approach that Gregwar proposed. enter image description here Based on the two approaches, I evaluated the performance of a node making the decision based on the best cut. Plotted is score_gregwar - score_GI enter image description here Sadly, I've got to work on other stuff. But maybe someone else want's to pick up on that. Based on what I see in this simulation, the differences don't seem that big. However, using more "real" data and not just drawing vertical splits might change this. Sorry!

Answered by ttreis on February 28, 2021

Misclassification error will not help in splitting the Tree.
Reason-We consider the weighted dip of error from parent Node to the child node and misclassification error will always result in 0(Other than pure splits).

Let's consider an example
Data = 1, 1, 0, 1, 0, 1, 0, 1, 0, 1
Parent Classification error= 4/10 = 0.4
Parent Gini Impurity = 1-(0.4x0.4+0.6x0.6) = 0.48

Case - I
Split - 1, 1, 0, 1 Vs 1, 0, 1, 0, 1, 0
Classification error= 0.4x0.25 + 0.6x0.5 = 0.4, split not possible
Gini Impurity = 0.45

Case - II
Split - 1, 1, 1, 0, 0 Vs 0, 1, 0, 1, 0
Classification error = 0.5x0.4 + 0.5x0.4 = 0.4, split not possible
Gini Impurity = 0.48

Case - III
Split - 1, 1, 0, 0, 1, 1 Vs 1, 0, 1, 0
Classification error = 0.6x(2/6) + 0.4x0.5 = 0.4, split not possible
Gini Impurity = 0.477

Pure splits
Split - 1, 1, 1, 1, 1, 1 Vs 0, 0, 0, 0
Classification error = 0, split possible but no further splits
Gini Impurity = 0

Reference-
Sebastian Raschka Blog

Answered by 10xAI on February 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP