Is fitting two RandomForestClassifiers 500 trees each and average their predicted probabilities on the test set more performant than one with 1000?

Question

If I fit two RandomForestClassifiers 500 trees each and average their predicted probabilities on the test set, would it have better results than fitting a RandomForestClassifier with 1000 trees and use it to get test set probabilities?
As these algorithms are random based I would say that their performance should be roughly the same?
I am okay with some math to prove it, or any other way that might prove it.

Julio Jesus · Answer

With a toy dataset, I obtained slightly better results with the RF 1000 trees
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from scipy.stats import ks_2samp

import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

n_models = 2
threshold = .5

X, y = load_breast_cancer(return_X_y= True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = .3)

rForest1000 = RandomForestClassifier(n_estimators= 1000, random_state= 42, oob_score= True).fit(X_train, y_train)
preds1000 = rForest1000.predict_proba(X_test)[:,1]

roc_score = roc_auc_score(y_true = y_test, y_score= preds1000)

print(f"Test set score for 1K trees is : {round(roc_score, 4)}")

ls = list()
for i in range(n_models):
    model = RandomForestClassifier(n_estimators= 500, random_state= i).fit(X_train, y_train)
    ls.append(model.predict_proba(X_test)[:,1])
preds = np.array(ls).mean(axis =0)

roc_score = roc_auc_score(y_true = y_test, y_score= preds)

print(f"Test set score for 500trees X 2 avg is : {round(roc_score, 4)}")

fig, ax = plt.subplots(1,2, figsize = (12,5))

ax[0].hist(preds1000[y_test == 0], color = "darkgreen", alpha = .5)
ax[0].hist(preds1000[y_test == 1], color = "darkred", alpha = .5)

ax[0].set_title(f"Predictions distribution with 1K Trees RFn KS: {[round(ks_2samp(preds1000[y_test == 0], preds1000[y_test == 1])[0],3)]}")

ax[1].hist(preds[y_test == 0], color = "darkgreen", alpha = .5)
ax[1].hist(preds[y_test == 1], color = "darkred", alpha = .5)

ax[1].set_title(f"Predictions distribution with X2 500 Trees RF Trees RFn KS: {[round(ks_2samp(preds[y_test == 0], preds[y_test == 1])[0],3)]}");

user419164 · Answer

Short answer: they are equivalent.
Any results that suggest otherwise are due to random chance or due to modification of parameters other than the number of trees. A random forest is just a voted ensemble of decision trees. By default, each tree's vote is weighted equally and then these votes are averaged. Suppose sets X and Y are the same size. Then if you take the average of X, the average of Y, then average those two, that's the exact same thing as just combining X and Y and averaging them. This is the same as if you have two random forests with the same number of trees.
Note, however, that if they have a different number of trees, then if you build two forests and average them, the individual trees making up the smaller forest will have their votes weighted more highly than the trees in the larger forest.

Is fitting two RandomForestClassifiers 500 trees each and average their predicted probabilities on the test set more performant than one with 1000?

2 Answers

Add your own answers!

Ask a Question