Decision Tree Induction using Information Gain and Entropy

Question

I’m trying to build a decision tree algorithm, but I think I misinterpreted how information gain works.

Let’s say we have a balanced classification problem.
So, the initial entropy should equal 1.

Let’s define information gain as follows:

info_gain = initial_entropy weighted_average(entropy(left_node)+entropy(right_node))

We gain information if we decrease the initial entropy, that is,
if info_gain > 0. If info_gain == 0
that means

weighted_average(entropy(left_node) + entropy(right_node)) == initial_entropy.

Let’s say we have 4 features such that

weighted_average(entropy(left_node) + entropy(right_node)) is in this order

wa_of_feature_0 > wa_of_feature_1 > … > wa_of_feature_4.

The more info_gain is larger than 0, the more the feature makes order in system.

So, based on our 4 features maximum information gain will be

info_gain_max = initial_entropy - wa_of_feature_4

since that would give us bigger number than using wa_of_feature_n where 1<=n<4.

Is this the correct interpretation?

decision trees

BlackCurrant · Answer

Let's us start by defining what are you trying to achieve in Decision tree.

So we want a tree which correctly classify the data?For that out of number of features available I want to select those features which gives me the best information about my classes i.e. How well a feature split my data? which shall help me classify better.

Now entropy is such techniques which helps me understand how "Pure" my subset it, i.e. if i select feature_1 to split upon how well it shall split my data that most of the class labels are from same class.

So Entropy is an impurity measure. If all instances are from same class then there is 0 impurity hence E = 0, if there are same number of instance from both classes then impurity is highest E=1.

Now, we need to select which attribute is the best to split upon first and then recursively which ones are later on ?

Here comes the Information gain.

Information gain just tells me how much useful/good an attribute is from rest of the attributes. For that we compare the entropy of the "Parent Node" before splitting to the impurity of "child Nodes" after splitting. Larger the difference , better the attribute test condition.

Higher gain = purer class

So, the initial entropy should equal 1.

1 is the highest entropy it means if I have 4 instance, 2 says +ve and 2 says-Ve hence its highly impure. It could very well be 0 if classes are like 4+ and 0 -ve. so on as per the calculations.

We gain information if we decrease the initial entropy

this initial entropy is the entropy of the data set before splitting or entropy of the parents node.It depends on the data.

The more info_gain is larger than 0, the more the feature makes order
  in system.

No, Its not compare with 0 but the gains of all the attribute with each other,

gain(w_1) = .4, g(W_2) = 1.2

Then highest gain is of W_2 hence DT will use W_2 for the split.

Imagine if you have a dataset with 3 attributes A1, A2, A3, then which attribute you should test first?

So, compute the entropy of complete set, i.e. E(complete)

same way E(A1), E(A2), E(A3),

Now

gain(A1) = E(complete)-E(A1) = .2, gain(A2) = E(complete)-E(A2) = .5,gain(A3) = E(complete)-E(A3) =1

highest gain = 1, hence we shall split on A3.

If there is same gain for attributes then we take the order in consideration.

10xAI · Answer

We compare the child entropy with the parent. So we must weigh the child as per the split size not 50-50.

Intuition - 
(A very big "ugly" child and a "Great" small child)
Weighing these two without considering the respective size, will give you a decent entropy but actually it's not a great split.

Code e.g.

a = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
p(0) = 0.4, p(1) = 0.6
entropy = -(0.4*np.log2(0.4) +  0.6*np.log2(0.6)) #It's equal to - 0.97

#Let's split and calculate the weighted dip
a1 = [0] ; a2 = [0, 0, 0, 1, 1, 1, 1, 1, 1]

#Without weight
a1 = 0
a2 = -(0.33*np.log2(0.66) +  0.66*np.log2(0.33)) #It's equal to - 1.25
Average = 1.25/2 #It's equal to - 0.625

Looks like a great dip(0.97 --> 0.625) in entropy but the data doesn't look any different than the parent

#With weight
a1 = 0
a2 = -(0.33*np.log2(0.66) +  0.66*np.log2(0.33)) #It's equal to - 1.25
Weighted average = (9/10) * 1.25 #It's equal to - 1.125

It is coming more(0.97 --> 1.125) than that of the parent

Decision Tree Induction using Information Gain and Entropy

2 Answers

Add your own answers!

Ask a Question