Data Science Asked on November 10, 2021
I’m working with a supervised learning problem and trying to predict a binary label and using a Random Forest to do so.
I’m trying to tune my hyper-parameters to give me a best model based on my data.
I can do this with GridSearchCV()
, but is this correct to do with a random forest?
If I’m using GridSearchCV()
, the training set and testing set change with each fold. From my understanding we can we set oob_true = True
in RandomForestClassifier()
, we are already evaluating on the out-of-bag samples (so CV is kind of already built in RF).
What is the convention to hyper-parameter tune with Random Forest to get the best OOB score in sklearn? Can I just loop through a set of parameters and fit on the same training and testing set? Can I use GridSearchCV()
, or does that make no sense with RF?
I can do this with
GridSearchCV()
, but is this correct to do with a random forest?
Yes, this is perfectly valid. It ignores the oob-score feature of random forests, but that isn't necessarily a bad thing. See e.g. https://stats.stackexchange.com/a/462720/232706
What is the convention to hyper-parameter tune with Random Forest to get the best OOB score in sklearn? Can I just loop through a set of parameters and fit on the same training and testing set?
I believe this would be the standard way of tuning using oob score, except that there is no testing set in this case. (You'll probably want a test set for future performance estimation of the final selected model though: selecting hyperparameters based on those oob scores means they are no longer unbiased estimates of future performance, just as in k-fold cross-validation! Your hyperparameter-candidate models shouldn't see that test set.)
To take advantage of the various conveniences of the hyperparameter searches in sklearn (parallelization, saved results, refitted best model, etc.), you can hack it as in my answer to another question:
https://datascience.stackexchange.com/a/66238/55122
You can't directly use oob score in a GridSearchCV
because that's coded to apply your scoring function to the test fold in each split.
Answered by Ben Reiniger on November 10, 2021
You can very well use the GridSearchCV to fine tune RandomForest.
I do not understand what you mean by "If I'm using GridSearchCV(), the training set and testing set change with each fold."
Generally we apply GridSearchCV on the test_data set after we do the train test split. The Crossvalidation splits the training data into multiple train and test split based on the Kfold value that you give. For example if k value is 10 then we will have the training data splitted into 10 folds where 1 will be used for testing and 9 together used for training. This happens until all the 10 folds are used for testing so you will get 10 accuracy score. In addition to this in gridsearchcv we pass a set of hyper parameters based on the model that we are using. This helps is finding the best hyperparameters for the model to get the best accuracy score and also to avoid overfitting.
On the other hand oob is some unseen data by the random forest model.
Let me know if you need more information in detail.
Hope this is helpful
Answered by kappil c on November 10, 2021
You can definitely use GridSearchCV with Random Forest. In fact you should use GridSearchCV to find the best parameters that will make your oob_score very high.
Some parameters to tune are:
n_estimators: Number of tree your random forest should have. The more n_estimators the less overfitting. You should try from 100 to 5000 range.
max_depth: max_depth of each tree. You should specify certain max_depth so that your model don't memorise train examples.
min_sample_split: the minimum number of samples to have before splitting into new nodes.
and many more...
But these are the main hyperparameters we tune to get a forest that works well.And to get these hyperparams you should use GridSeachCV.
Answered by SrJ on November 10, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP