Random Forest - Explanation Parameter

Question

I got some question about the "standard" parameter from a random forest. Following I write my understanding about these parameters. I would be glad if I could confirm my understanding or correct it. :)
For your information, I'm using scikit-learn.

max_depth: Thats the depth of a tree. In case of execution-performance it is good to set a limit. And for the model-performance it is also good, because of overfitting.

max_features: With this parameter I specifie der number of features with which tree is build. 
My question: If the parameter is set (for example 0.75), each tree includes 75% of all features, but every tree is build with different 75% feature?

min_samples_leaf: The minimal number of samples per leaf. Is this an important feature and why?

n_estimators: The number of trees. Is there a good default value for this parameter? Here are usually more trees better, right?

Are these the most important parameters or have I forgotten some?

TitoOrt · Answer

You need to be very careful with your assumptions about "important" parameters. All the parameters have their function, that is precisely why they are implemented on the first place. If with importance you mean potential to affect your outputs/efficiency, perhaps there are some more relevant than others but most likely this relevance will also depend on the nature of your task and you data.

Having said that, I recommend you navigate through the scikit-learn documentation which is usually helpful and it contains a description including all these parameters:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Moreover, they usually include nice and simple usage examples that can be helpful: here and here.

It is important that you practice and practice and you will start to get a feeling about the importance of each parameter and its relevance to the task at hand.

Finally, there are hundreds of blogs with nice tutorials that will give you a little bit more of insights about parameter tuning and "importance" ;)

You can try this once for example.

Good luck!

aranglol · Answer

max_features selects the number of features per split, not per tree. So, at every split of the tree, the algorithm randomly selects x% of your columns and selects the best one amongst this subset that leads to the most information gain/lowest loss.

That being said, max_features and max_depth (even tuning max_depth might go slightly against the original idea for the algorithm which was to fit a bunch of overfit = high variance trees and then minimize the variance via. averaging, whilst mitigating correlation amongst trees by randomly selecting features at every split) are generally sufficient enough to control overfitting. Set n_estimators to be as large as computationally possible so that you can minimize variance as much as you can when averaging.

However, there are no hard rules. I'd recommend starting with tuning max_features first, max_depth next, and all of the other min/max_xyz hyperparameters if results are still not satisfactory. In general, random forests should not require large amounts of tuning to be decent.

Random Forest - Explanation Parameter

2 Answers

Add your own answers!

Ask a Question