Cross Validated Asked on November 2, 2021
This is a cross post of a question i posted several months ago here on a different forum.
I am trying to understand how variable importance is calculated from the research papers published on the topic. There are two papers to pull this from. The first is the very math heavy original paper and presentation. The other is the less math more intuitive treatment and much easier to understand.
The key paragraph from the more intuitive explanation is below
The set of variables $Z$ to be conditioned on should contain all variables that are correlated with the current variable of interest $X_j$. In the
varimp
function,this is assured by the small default value 0.2 of the threshold argument: By default, all variables whose correlation with $X_j$ meets the condition 1 – ($p$-value) > 0.2 are used for conditioning. A larger value of threshold would have the effect that only those variables that are strongly correlated with $Xj$ would be used for conditioning, but would also lower the computational burden.
The best intuition I have seen from conditional inference trees is this blog post and when I stepped through it, it made instant sense on how the trees work. They stopped short and variable importance 🙂
My understanding and where it all breaks down:
I think this would make more sense to me if I tried to see Z looks like for a single tree so I ran the code below. Can anyone help?
library(party)
library(janitor)
library(tidyverse)
set.seed(123)
# Create a dataframe where we are trying to predict setosa
mydf <- iris %>%
mutate(set_tgt = factor(ifelse(Species == 'setosa', 'yes', 'no'))) %>%
select(-Species)
# We will try to predict "set_tgt"
cf_mod <- cforest(set_tgt ~ ., data = mydf, control = cforest_unbiased(mtry = 2, ntree = 3))
# If we use conditional set to true it permutates the variables
# based on the threshold
varimp(cf_mod, conditional = TRUE, threshold = 0.2) %>%
enframe() %>%
arrange(desc(value))
# Finding Z
mod <- ctree(set_tgt ~ .,data = mydf)
plot(mod)
# Row names are the label we are trying to predict
Z <- tibble("Petal.Length <= 1.9" = 50, "Petal.Length > 1.9" = 0) %>%
bind_rows(tibble("Petal.Length <= 1.9" = 0, "Petal.Length > 1.9" = 100)) %>%
data.frame() %>%
clean_names()
row.names(Z) <- c("no", "yes")
# This creates the Z dataframe (maybe), if my understanding isn’t completely wrong
Z
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP