How to extract the sample split (values) of decision tree leaves ( terminal nodes) applying h2o library

Question

Sorry for a long story, but it is a long story. :) 
I am using h2o library for python to build a decision tree and to extract a decision rules out of it. 
I use some data for training where labels get TRUE and FALSE values.
My final goal is to extract the significant path (leaf) of the tree where the number of TRUE cases significantly exceeds that of FALSE ones.

treemodel=H2OGradientBoostingEstimator(ntrees = 3, max_depth = maxDepth, distribution="bernoulli")
treemodel.train(x=somedata.names[1:],y=somelabel.names[0], training_frame=somedata) 
dtree = H2OTree(model = treemodel, tree_number = 0, tree_class = False)

def predict_leaf_node_assignment(self, test_data, type="Path"):        
    if not isinstance(test_data, h2o.H2OFrame): raise ValueError("test_data 
    must be an instance of H2OFrame")
       assert_is_type(type, None, Enum("Path", "Node_ID"))
    j = h2o.api("POST /3/Predictions/models/%s/frames/%s" % (self.model_id, 
    test_data.frame_id),
        data={"leaf_node_assignment": True, "leaf_node_assignment_type": 
    type})
    return h2o.get_frame(j["predictions_frame"]["name"])
dfResLabH2O.leafs = predict_leaf_node_assignment( dtree,test_data=dfResLabH2O , type="Path")

In sklearn there is an option to explore the leaves by using tree.values 
There is not such an option for h2o as I understand. 
Instead of that there is an option in h2o to return predictions on leaves.

When I run dtree.predictions 
I am getting pretty weird results

dtree.predictions
Out[32]: [0.0, -0.020934915, 0.0832189, -0.0151052615, -0.13453846, -0.0039859135, 0.2931017, 0.0836743, -0.008562919, -0.12405087, -0.02181114, 0.06444048, -0.01736593, 0.13912177, 0.10727943]***

My questions (and somebody has already asked it, but no clear answer was provided so far)

What's the meaning of negative predictions? I expect to get a proportions p of TRUE to ALL or FALSE to ALL, where 0<=p<=1. Anything wrong with my model.
I ran it in skitlearn and I can point out the certain significant paths and extract rules. 
For positive values : Is it TRUE to ALL or False to ALL proportion? I am guessing it so FALSE as I mentioned Class=False, but I am not sure. 
Is there any method or solution for h20 trees to reveal the sample size of the certain leaf and the [n1,n2] for TRUE and FALSE cases respectively in  a similar way that sklearn provides? 
I found in some forums a function def predict_leaf_node_assignment that aims to predict on a dataset and to return the leaf node assignment (only for tree-based models), but it returns no output and I cannot find any example how to implement it. 
The bottom line : I'd like to be able to extract the sample size values of the leaf and to extract the specific path to it, implementing [n1,n2] or valid proportions.

I'll appreciate any kind of help and suggestions. 
Thank you.

Ben Reiniger · Answer

So far I'm not seeing a way to extract training information from the model.  The H2OTree.predictions can/should give you proportion information, but won't give you leaf sample sizes.  For that, you should be able to use predict_leaf_node_assignment, passing your training set in (to wastefully get passed through the model, *shrug*).

predict_leaf_node_assignment should return a dataframe with the leaf assignment for each of your training points.  (The R version appears to support returning either the path or the node id, but the python one doesn't seem to have it.)  You could take this, join to the original frame, and use group and aggregation functions to produce the desired [n1,n2].*

Regarding the output of predictions, see  https://stackoverflow.com/questions/44735518/how-to-reproduce-the-h2o-gbm-class-probability-calculation .  In particular, the default learning rate in H2O's GBM is 0.1, which helps explain your muted results.

Finally, for a little more fun with the the model's tree objects, see https://www.pavel.cool/machine%20learning/h2o-3/h2o-3-tree-api/
and
https://novyden.blogspot.com/2018/12/finally-you-can-plot-h2o-decision-trees.html

*EDIT: For doing the grouping and aggregation:
(I'm more used to pandas than H2O frames, so I'll convert first.  And given that H2O thinks your FALSE class is the main class, maybe those are strings not boolean?)

predicted_leaves_frame = treemodel.predict_leaf_node_assignment(data).as_data_frame()
df = data.as_data_frame()
df['binary_dep_var'] = df['dep_var'].apply(lambda x: 1 if x=='TRUE' else 0)
df['T1'] = predicted_leaves_frame['T1.C1']
print(df.groupby('T1')['binary_dep_var'].agg(['sum','count','mean'])

This should give for each leaf the number of TRUE samples and the total number of samples and the ratio.  If you really need the number of FALSE samples, you could define your own aggregation function or just post-process this new dataframe.

How to extract the sample split (values) of decision tree leaves ( terminal nodes) applying h2o library

One Answer

Add your own answers!

Ask a Question