Data Science Asked by Krushe on June 18, 2021
I’m trying to build a decision tree algorithm, but I think I misinterpreted how information gain works.
Let’s say we have a balanced classification problem.
So, the initial entropy should equal 1.
Let’s define information gain as follows:
info_gain = initial_entropy weighted_average(entropy(left_node)+entropy(right_node))
We gain information if we decrease the initial entropy, that is,
if info_gain > 0. If info_gain == 0
that means
weighted_average(entropy(left_node) + entropy(right_node)) == initial_entropy.
Let’s say we have 4 features such that
weighted_average(entropy(left_node) + entropy(right_node))
is in this order
wa_of_feature_0 > wa_of_feature_1 > … > wa_of_feature_4.
The more info_gain is larger than 0, the more the feature makes order in system.
So, based on our 4 features maximum information gain will be
info_gain_max = initial_entropy - wa_of_feature_4
since that would give us bigger number than using wa_of_feature_n where 1<=n<4.
Is this the correct interpretation?
Let's us start by defining what are you trying to achieve in Decision tree.
So we want a tree which correctly classify the data?For that out of number of features available I want to select those features which gives me the best information about my classes i.e. How well a feature split my data? which shall help me classify better.
Now entropy is such techniques which helps me understand how "Pure" my subset it, i.e. if i select feature_1 to split upon how well it shall split my data that most of the class labels are from same class.
So Entropy is an impurity measure. If all instances are from same class then there is 0 impurity hence E = 0
, if there are same number of instance from both classes then impurity is highest E=1
.
Now, we need to select which attribute is the best to split upon first and then recursively which ones are later on ?
Here comes the Information gain.
Information gain just tells me how much useful/good an attribute is from rest of the attributes. For that we compare the entropy of the "Parent Node" before splitting to the impurity of "child Nodes" after splitting. Larger the difference , better the attribute test condition.
Higher gain = purer class
So, the initial entropy should equal 1.
1 is the highest entropy it means if I have 4 instance, 2 says +ve and 2 says-Ve hence its highly impure. It could very well be 0 if classes are like 4+ and 0 -ve. so on as per the calculations.
We gain information if we decrease the initial entropy
this initial entropy is the entropy of the data set before splitting or entropy of the parents node.It depends on the data.
The more info_gain is larger than 0, the more the feature makes order in system.
No, Its not compare with 0 but the gains of all the attribute with each other,
gain(w_1) = .4, g(W_2) = 1.2
Then highest gain is of W_2 hence DT will use W_2 for the split.
Imagine if you have a dataset with 3 attributes A1, A2, A3, then which attribute you should test first?
So, compute the entropy of complete set, i.e. E(complete)
same way E(A1), E(A2), E(A3),
Now
gain(A1) = E(complete)-E(A1) = .2, gain(A2) = E(complete)-E(A2) = .5,gain(A3) = E(complete)-E(A3) =1
highest gain = 1
, hence we shall split on A3.
If there is same gain for attributes then we take the order in consideration.
Answered by BlackCurrant on June 18, 2021
We compare the child entropy with the parent. So we must weigh the child as per the split size not 50-50.
Intuition -
(A very big "ugly" child and a "Great" small child)
Weighing these two without considering the respective size, will give you a decent entropy but actually it's not a great split.
Code e.g.
a = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
p(0) = 0.4, p(1) = 0.6
entropy = -(0.4*np.log2(0.4) + 0.6*np.log2(0.6)) #It's equal to - 0.97
#Let's split and calculate the weighted dip
a1 = [0] ; a2 = [0, 0, 0, 1, 1, 1, 1, 1, 1]
#Without weight
a1 = 0
a2 = -(0.33*np.log2(0.66) + 0.66*np.log2(0.33)) #It's equal to - 1.25
Average = 1.25/2 #It's equal to - 0.625
Looks like a great dip(0.97 --> 0.625) in entropy but the data doesn't look any different than the parent
#With weight
a1 = 0
a2 = -(0.33*np.log2(0.66) + 0.66*np.log2(0.33)) #It's equal to - 1.25
Weighted average = (9/10) * 1.25 #It's equal to - 1.125
It is coming more(0.97 --> 1.125) than that of the parent
Answered by 10xAI on June 18, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP