Average of importance gain for a categorical variable

Question

Suppose I have a set of M categorical variables, some of them with a different number of categories (for instance, var1 has five categories, var2 has three, etc).
I train an XGBoost model on a numeric target Y after having performed one-hot encoding on the M categorical variables, thus creating a set of dummy inputs.

When looking at the model results, I get a table of importance gain for the categories of each feature, meaning how important they are in the model. A toy result would look like this:

The main question I'm asking is the following:

In order to get an idea of how important a variable is overall rather than just one of its categories (for instance, how much var1 is important overall rather than just category cat3 of var1), does it make sense to take the average of all the importance gains for each feature as an importance indicator?

Probably the sum of such gains would not be correct as the features may have a different number of categories, but I'm wondering if the average of such gains might serve as an indicator of the importance of a particular feature overall.

I already looked at some questions like this without gaining much insight about this topic.

Akavall · Answer

I think you are looking for information gain.

The way you would compute it for 1 variable is:

Let's say that you label variable is binary.

1) Compute percentages of label per category, for example you have three categories: "US", "UK", "Ger", if there are 5 labels that are marked 1, and 3 of the area associated with "UK", 2 with the "US" and 0 with "Ger", your percentages would be: [3/5, 2/5, 0/5].

2) Do step 1) for every label and aggregate

3) Calculate entropy for every percentage

4) Information Gain = Entropy of aggregated percentage - sum of weighted percentages. Weight is just number of instances in a label / total number. For example if we had labels = [1,1,0,0,1,1,0,0,1], entropy of percentages of label 1 would be weighted by 5/9 and entropy of percentages of label 0 would be weighted by 4/9.

Then you compute IG for every variable, and compare!

As a side note, if you are working with a lot of categorical variables, you might want to look into LightGBM or CatBoost, these algorithms allow you to specify categorical variables without creating one hot encoded vectors, and they provide feature importance on these variables.

Average of importance gain for a categorical variable

One Answer

Add your own answers!

Ask a Question