The notation of $splits(label)$ under Random Forest

Question

On the "Fair Forests: Regularized Tree Induction to Minimize Model Bias", it is written that

We propose a simple regularization approach to constructing a fair decision tree induction algorithm. This is done by altering the way we measure the information gain $G(T, a)$, where $T$ is a set of training examples, and $a$ is the attribute to split on. We will denote the set of points in each of the $k$ branchs of the tree as $T_{i ldots k}$. This normally is combined with an impurity measure $I(T)$, to give us
$$
G(T, a)=I(T)-sum_{forall T_{i} in text { splits }(a)} frac{left|T_{i}right|}{|T|} cdot Ileft(T_{i}right)
$$
The information gain scores the quality of a splitting attribute $a$ by how much it reduces impurity compared to the current impurity. The larger the gain, the more pure the class labels have become, and thus, should improve classification performance. In the CART algorithm, the Gini impurity is normally used for categorical targets.
$$
I_{mathrm{Gini}}(T)=1-sum_{forall T_{i} in text { splits }(text { label })}left(frac{left|T_{i}right|}{|T|}right)^{2}
$$

Especially, those two things make me confused very much.

The definiton of $mathrm{splits(a)}$ is not written in the paper.
It is written that $T_i$ is the branch, however, for me, it seems a node.

I have already learned about Random Forest on the ESL, so I know the normal definition of impurity as:
$$
I_G(p) = 1 - sum_{i=1}^{k} p_i^2
$$
However, I can't figure out each definition in this paper...

Ashish Jain · Answer

You correctly mentioned the definition of impurity which is
$$I_G(P) = 1 - sum _{i=1}^k p_i^2$$
This can be written as
$$I_G(P) = sum _{i=1}^k p_i*(1 - p_i)$$
At any split, for any branch $T_i$, you calculate the probability using the classical definition i.e.,
$$p_i = frac{|T_i|}{|T|}$$
Using this, you can derive the impurity definition $I(T_i)$ given in the paper.
Once you are able to calculate the impurity of a branch you can calculate the impurity of a node using the weighted sum of each branch, where weights are the probability of that branch. Hence,
$$I_{node} = sum_{forall T_i  in splits(a)} frac{|T_i|}{|T|} I(T_i)$$
Using the above impurity, you can calculate the change in impurity or the information gain as given in the paper.
I hope this solves your doubt.

The notation of $splits(label)$ under Random Forest

One Answer

Add your own answers!

Ask a Question