Data Science Asked by Gregwar on February 28, 2021
In a decision tree, Gini Impurity[1] is a metric to estimate how much a node contains different classes. It measures the probability of the tree to be wrong by sampling a class randomly using a distribution from this node:
$$
I_g(p) = 1 – sum_{i=1}^J p_i^2
$$
If we have 80% of class C1 and 20% of class C2, labelling randomly will then yields 1 – 0.8 x 0.8 – 0.2 x 0.2 = 0.32 Gini impurity value.
However, assigning randomly a class using the distribution seems like a bad strategy compared with simply assigning the most represented class in this node (in above example, you would just label C1 all the time and get only 20% of error instead of 32%).
In that case, I would be tempted to simply use this as a metric, since it is also the probability of mislabeling :
$$
I_m(p) = 1 – max_i [ p_i]
$$
Is there a deeper reason to use Gini and/or a good reason not to use this approach instead ? (In other words, Gini seems to over-estimate the mislabellings that will happen, isn’t it ?)
[1] https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity
EDIT: Motivation
Suppose you have two classes $C_1$ and $C_2$, with probabilities $p_1$ and $p_2$ ($1 ge p_1 ge 0.5 ge p_2 ge 0$, $p_1 + p_2 = 1$).
You want to compare strategy "always label $C_1$" with strategy "label $C_1$ with $p_1$ probability, and $C_2$ with $p_2$ probability", thus the probability of success are respectively $p_1$ and $p_1^2 + p_2^2$.
We can rewrite this second one to:
$$
p_1^2 + p_2^2 = p_1^2 + 2p_1p_2 – 2p_1p_2 + p_2^2 = (p_1 + p_2)^2 – 2p_1p_2 = 1 – 2p_1p_2
$$
Thus, if we substract it to $p_1$:
$$
p_1 – 1 + 2p_1p_2 = 2p_1p_2 – p_2 = p_2 ( 2p_1 – 1)
$$
Since $p_1 ge 0.5$, then $p_2 ( 2p_1 – 1) ge 0$, and thus:
$$
p_1 ge p_1^2 + p_2^2
$$
So choosing the class with highest priority is always a better choice.
EDIT: Choosing an attribute
Suppose now we have $n_1$ items in $C_1$ and $n_2$ items in $C_2$. We have to choose which attribute $a in A$ is the best to split the node. If we use superscript $n^v$ for number of items that have a value $v$ for a given attribute (and $n^v_1$ items of $C_1$ that have value $v$), I propose we use the score:
$$
sum_v frac{n^v}{n_1 + n_2} frac{max(n^v_1, n^v_2)}{n^v_1 + n^v_2}
$$
As a criterion instead of Gini.
Note, since $n^v = n^v_1 + n^v_2$ and $n_1 + n_2$ doesn’t depend on the choosen attribute, this can be rewritten:
$$
sum_v max(n^v_1, n^v_2)
$$
And simply interpreted as the number of items in the dataset that will be properly classified.
Since I only had a gut feeling about this, I decided to implement it in code. For this, I followed the method described at https://victorzhou.com/blog/gini-impurity/. Generally, calculating the GI provides you with a metric for which you don't have to know the underlying distribution which I think is the reason why your example works.
Generally, my conclusion is "Yes, there are situations in which a majority count is more useful, but usually, the GI usually produces a better separation".
Here is what I've done, following the link mentioned above:
np.sqrt((data.p_x*data.p_y))
was bigger or smaller than a cutoff, I have assigned either blue or red as colors.I simplified the real world by only using vertical splits, shown in light grey. Also, I'm approaching a 2D problem basically in 1D. The horizontal lines correspond be the accuracy of the majority vote principle you proposed.
I uploaded the code here: https://github.com/dazim/stackoverflow_answers/blob/main/gini-impurity-in-decision-tree-reasons-to-use-it.ipynb
Edit 2021-02-19:
Based on the discussion in the comments I replaced the GI with the approach that Gregwar proposed.
Based on the two approaches, I evaluated the performance of a node making the decision based on the best cut. Plotted is score_gregwar - score_GI
Sadly, I've got to work on other stuff. But maybe someone else want's to pick up on that. Based on what I see in this simulation, the differences don't seem that big. However, using more "real" data and not just drawing vertical splits might change this. Sorry!
Answered by ttreis on February 28, 2021
Misclassification error will not help in splitting the Tree.
Reason-We consider the weighted dip of error from parent Node to the child node and misclassification error will always result in 0(Other than pure splits).
Let's consider an example
Data = 1, 1, 0, 1, 0, 1, 0, 1, 0, 1
Parent Classification error= 4/10 = 0.4
Parent Gini Impurity = 1-(0.4x0.4+0.6x0.6) = 0.48
Case - I
Split - 1, 1, 0, 1 Vs 1, 0, 1, 0, 1, 0
Classification error= 0.4x0.25 + 0.6x0.5 = 0.4, split not possible
Gini Impurity = 0.45
Case - II
Split - 1, 1, 1, 0, 0 Vs 0, 1, 0, 1, 0
Classification error = 0.5x0.4 + 0.5x0.4 = 0.4, split not possible
Gini Impurity = 0.48
Case - III
Split - 1, 1, 0, 0, 1, 1 Vs 1, 0, 1, 0
Classification error = 0.6x(2/6) + 0.4x0.5 = 0.4, split not possible
Gini Impurity = 0.477
Pure splits
Split - 1, 1, 1, 1, 1, 1 Vs 0, 0, 0, 0
Classification error = 0, split possible but no further splits
Gini Impurity = 0
Reference-
Sebastian Raschka Blog
Answered by 10xAI on February 28, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP