TransWikia.com

Using a random forest, would a RandomForest performance be less if I drop the first or the last tree?

Data Science Asked on December 22, 2020

Suppose I’ve trained a RandomForest model with 100 trees. I then have two cases:

  • I drop the first tree in the model.
  • I drop the last tree in the model.

Would the model performance be less in the first or the second case?

As the last tree should be the best trained one, I would say that the first scenario should be less performant than the last one.

And what if I was using another model like a Gradient Boosting Decision tree? I guess it should be the same.

I am okay with some math to prove it, or any other way that might prove it.

Update

I tried with two different learning rates 0.1 and 8. With 0.1 I get:

# For convenience we will use sklearn's GBM, the situation will be similar with XGBoost and others
clf = GradientBoostingClassifier(n_estimators=5000, learning_rate=0.01, max_depth=3, random_state=0)
clf.fit(X_train, y_train)

y_pred = clf.predict_proba(X_test)[:, 1]
# "Test logloss: {}".format(log_loss(y_test, y_pred)) returns  0.003545821535500366

def compute_loss(y_true, scores_pred):
    '''
        Since we use raw scores we will wrap log_loss 
        and apply sigmoid to our predictions before computing log_loss itself
    '''
    return log_loss(y_true, sigmoid(scores_pred))
    

'''
    Get cummulative sum of *decision function* for trees. i-th element is a sum of trees 0...i-1.
    We cannot use staged_predict_proba, since we want to manipulate raw scores
    (not probabilities). And only in the end convert the scores to probabilities using sigmoid
'''
cum_preds = np.array([x for x in clf.staged_decision_function(X_test)])[:, :, 0] 

print ("Logloss using all trees:           {}".format(compute_loss(y_test, cum_preds[-1, :])))
print ("Logloss using all trees but last:  {}".format(compute_loss(y_test, cum_preds[-2, :])))
print ("Logloss using all trees but first: {}".format(compute_loss(y_test, cum_preds[-1, :] - cum_preds[0, :])))

which gives:

Logloss using all trees:           0.003545821535500366
Logloss using all trees but last:  0.003545821535500366
Logloss using all trees but first: 0.0035335315747614293

Whereas with 8 I obtain:

clf = GradientBoostingClassifier(n_estimators=5000, learning_rate=8, max_depth=3, random_state=0)
clf.fit(X_train, y_train)

y_pred = clf.predict_proba(X_test)[:, 1]
# "Test logloss: {}".format(log_loss(y_test, y_pred)) returns 3.03310165292726e-06

cum_preds = np.array([x for x in clf.staged_decision_function(X_test)])[:, :, 0] 

print ("Logloss using all trees:           {}".format(compute_loss(y_test, cum_preds[-1, :])))
print ("Logloss using all trees but last:  {}".format(compute_loss(y_test, cum_preds[-2, :])))
print ("Logloss using all trees but first: {}".format(compute_loss(y_test, cum_preds[-1, :] - cum_preds[0, :])))

gives:

Logloss using all trees:           3.03310165292726e-06
Logloss using all trees but last:  2.846209929270204e-06
Logloss using all trees but first: 2.3463091271266125

3 Answers

The two slightly-smaller models will perform exactly the same, on average. There is no difference baked in to the different trees: "the last tree will be the best trained" is not true. The only difference among the trees is the random subsample they work with and random effects while building the tree (feature subsetting, e.g.).

Gradient boosted trees are a different story. If you drop the first tree after you finish training, the resulting model will be mostly garbage. Every subsequent tree was trained to improve upon the fit of the previous trees, and removing any single tree will put all future trees out of context. (To give an extreme example, suppose the first tree actually captures "the correct" model. All future trees will just fit on the remaining noise.) On the other hand, removing the final tree is equivalent to having just trained one fewer tree, which may be good or bad depending on your bias-variance tradeoff at that point.

Correct answer by Ben Reiniger on December 22, 2020

In Random Forest, each trea of the forest is trained independant from the others. There's no relation between trees.

To summarise very quickly, if you have a dataset with 10 attributes, each tree will select n (a parameter you have to fix) attributes among the 10, and create a basic decision tree (like C4.5 style) only with those n attributes knowledge. Then, when you want to predict a new value, it goes to all the trees of your forest, and predict the output the majority of trees predicted.

So wether you remove number 1st, k, or 100th tree, the model will act the same (and almost the same as if you'd remove nothing, since it would just transform the model to a 99-tree forest instead of a 100 one).

Answered by BeamsAdept on December 22, 2020

In the case of Random Forest, a new tree is built without any input from the previously built trees. If the number of trees built is high, dropping any one tree when making a decision won't affect the final output of the random forest model unless the dropped tree holds information about an extreme outlier that impacts the ensemble model.

In the case of Boosting, the output of the trees are aggregated in the following fashion:

$f^1(x) = f^{0}(x)+theta_1phi_1(x)$

$f^2(x) = f^{0}(x)+theta_1phi_1(x) + theta_2phi_2(x) = f^{1}(x)+theta_2phi_2(x)$

$f^2(x) = f^{0}(x)+theta_1phi_1(x) + theta_2phi_2(x) +theta_3phi_3(x) = f^{2}(x)+theta_3phi_3(x)$

...

$f^n(x) = f^{(n-1)}(x)+theta_mphi_m(x)$

where $f^0(x)$ is an initial guess, $f^i(x)$ is the function learnt by the ensemble with $i$ trees, $phi_i(x)$ is the $i$-th tree, $theta_i$ is the $i$-th weight associated with the $i$-th tree and tree $phi_i$ is learnt based on the error made by $f^{i-1}(x)$.

How the tree removal impacts the ensemble model depends on the function $f(x)$ you are trying to learn. Here are 2 simplified scenarios:

  1. If $f(x)$ is simple enough that the $f^1(x)$ is able to capture $f(x)$ from the first tree $phi^1(x)$, the subsequent trees will add little value to the ensemble model. In that case, you may not see any noticeable drop in performance if you drop the final trees.

  2. If $f(x)$ is complex, then dropping $f^1(x)$ from the ensemble model will noticeably impact the ensemble model's performance. In this setting when $n$ is large, the $n$-th tree might add little value to the ensemble model.

Answered by cmn on December 22, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP