Data Science Asked by CyberPlayerOne on July 29, 2021
I’ve trained a XGBoost model for regression, where the max depth is 2.
# Create the ensemble
ensemble_size = 200
ensemble = xgb.XGBRegressor(n_estimators=ensemble_size, n_jobs=4, max_depth=2, learning_rate=0.1,
objective='reg:squarederror')
ensemble.fit(train_x, train_y)
I’ve plotted the first tree in the ensemble:
# Plot single tree
plot_tree(ensemble, rankdir='LR')
Now I retrieve the leaf indices of the first training sample in the XGBoost ensemble model:
ensemble.apply(train_x[:1]) # leaf indices in all 200 base learner trees
array([[6, 6, 4, 6, 4, 6, 5, 5, 4, 5, 4, 3, 5, 4, 5, 3, 6, 3, 5, 5, 3,
3,
3, 5, 4, 4, 3, 4, 3, 6, 6, 6, 4, 6, 6, 3, 5, 3, 5, 4, 6, 4, 4, 6,
3, 3, 6, 3, 6, 3, 4, 3, 6, 6, 3, 6, 5, 3, 6, 6, 3, 4, 6, 5, 3, 3,
3, 6, 3, 4, 3, 6, 3, 6, 3, 3, 3, 4, 6, 3, 4, 4, 6, 3, 3, 6, 3, 6,
6, 3, 3, 4, 4, 4, 3, 3, 6, 6, 3, 3, 6, 3, 3, 3, 6, 6, 6, 4, 4, 3,
5, 3, 3, 3, 4, 5, 3, 3, 6, 3, 3, 6, 3, 4, 5, 3, 6, 3, 5, 3, 4, 4,
3, 3, 4, 6, 6, 6, 6, 3, 4, 4, 3, 5, 6, 6, 3, 5, 3, 3, 6, 6, 3, 3,
6, 3, 3, 4, 4, 3, 4, 3, 5, 3, 3, 3, 3, 3, 4, 4, 6, 3, 6, 4, 4, 5,
6, 3, 4, 5, 6, 3, 4, 3, 4, 5, 6, 6, 5, 4, 3, 3, 6, 6, 3, 6, 5, 4,
3, 3]], dtype=int32)
Here is my question:
Since there are four leaf nodes in the first tree, how come there is
index 6 for the first training sample?
In the official doc for apply(), it says "Leaves are numbered within [0; 2**(self.max_depth+1)), possibly with gaps in the numbering." So if max_depth is 2, the leaves are numbered between 0 and 7. Since there are only four leaves in a binary tree of depth 2, shouldn’t the leaves numbered within [0, 4)? What is the reason behind the design $[0; 2^{(self.max_depth+1)})$?
Related question: https://stackoverflow.com/questions/58585537/how-to-interpret-the-leaf-index-in-xgboost-tree
I think what you are seeing is the fact that all nodes in the tree are indexed because a priori the model doesn't know where splits will happen (i.e. any node could be a leaf). My guess is that the nodes follow an ordering similar to:
In your case all of the leaf nodes are at the max depth of the tree, so nodes 3-6 show up in your list. By contrast, if your data was all the same value I would expect all labels to be node 0 (because the split criteria was not met). And then you could have intermediate situations where after 1 split there is a node which does not meet the split criteria (in this case you could see either node 1 or 2 show up in your list).
Hope this helps!
Correct answer by Brandon Donehoo on July 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP