Data Science Asked by theonionring0127 on May 29, 2021
In Scikit-learn’s random forest, you can set bootstrap=True and each tree would select a subset of samples to train on. Is there a way to see which samples are used in each tree?
I went through the documentation about the tree estimators and all the attributes of the trees that are made available by Scikit-learn, but none of them seems to provide what I’m looking for.
I don't think it is possible to get it directly but we may utilize the random seed.
random_stateint, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True)
This is from the RF Github code
def _generate_sample_indices(random_state, n_samples, n_samples_bootstrap):
"""
Private function used to _parallel_build_trees function."""
random_instance = check_random_state(random_state)
sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap)
return sample_indices
So, we can get these with a custom code if we fixed the seed above e.g. for 2 Tree,
import pandas as pd, numpy as np
num = 20 # max index
np.random.seed(0) # Fix the seed
sample_1 = np.random.randint(0,num,(1,num))
oob_1 = [elem for elem in np.arange(num) if elem not in sample_1 ]
sample_2 = np.random.randint(0,num,(1,num))
oob_2 = [elem for elem in np.arange(num) if elem not in sample_2 ]
Please verify it with a custom code. I have not verified it.
Answered by 10xAI on May 29, 2021
It is possible, actually. The answer is not too different than the one given by @10xAI, but it is not trying to exploit the order of the random seeds implicitly, since it would break for parallel training. So the answer above could maybe only work for trees not trained in parallel. But not sure.
The actual working answer is simple, and it resides in using the random generator stored in each estimator and using it to redo the random sampling.
So, for instance, assume rf
is your trained random forest, then it is easy to get both sampled and unsampled indices by importing the appropriate functions and replicating the sampling using the seed in each rf.estimators[0].random_state
. For example, to retrieve the lists of sampled and unsampled indices:
import sklearn.ensemble._forest as forest_utils
n_samples = len(Y) # number of training samples
n_samples_bootstrap = forest_utils._get_n_samples_bootstrap(
n_samples, rf.max_samples
)
unsampled_indices_trees = []
sampled_indices_trees = []
for estimator in rf.estimators_:
unsampled_indices = forest_utils._generate_unsampled_indices(
estimator.random_state, n_samples, n_samples_bootstrap)
unsampled_indices_trees.append(unsampled_indices)
sampled_indices = forest_utils._generate_sample_indices(
estimator.random_state, n_samples, n_samples_bootstrap)
sampled_indices_trees.append(sampled_indices)
estimator
is a decision tree in this case, so one can use all the methods to compute custom oob_scores
and whatnot.
Hope this helps!
Answered by pixelmitch on May 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP