Data Science Asked by Ric S on February 19, 2021
Suppose I have a set of M categorical variables, some of them with a different number of categories (for instance, var1 has five categories, var2 has three, etc).
I train an XGBoost model on a numeric target Y after having performed one-hot encoding on the M categorical variables, thus creating a set of dummy inputs.
When looking at the model results, I get a table of importance gain for the categories of each feature, meaning how important they are in the model. A toy result would look like this:
feature | category gain
var1 | cat3 25
var2 | cat1 20
var1 | cat5 12
var5 | cat6 11
var4 | cat1 8
... ...
The main question I’m asking is the following:
Probably the sum of such gains would not be correct as the features may have a different number of categories, but I’m wondering if the average of such gains might serve as an indicator of the importance of a particular feature overall.
I already looked at some questions like this without gaining much insight about this topic.
I think you are looking for information gain.
The way you would compute it for 1 variable is:
Let's say that you label variable is binary.
1) Compute percentages of label per category, for example you have three categories: "US", "UK", "Ger", if there are 5 labels that are marked 1, and 3 of the area associated with "UK", 2 with the "US" and 0 with "Ger", your percentages would be: [3/5, 2/5, 0/5]
.
2) Do step 1) for every label and aggregate
3) Calculate entropy for every percentage
4) Information Gain = Entropy of aggregated percentage - sum of weighted percentages. Weight is just number of instances in a label / total number. For example if we had labels = [1,1,0,0,1,1,0,0,1]
, entropy of percentages of label 1 would be weighted by 5/9 and entropy of percentages of label 0 would be weighted by 4/9.
Then you compute IG for every variable, and compare!
As a side note, if you are working with a lot of categorical variables, you might want to look into LightGBM or CatBoost, these algorithms allow you to specify categorical variables without creating one hot encoded vectors, and they provide feature importance on these variables.
Answered by Akavall on February 19, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP