TransWikia.com

Hyperparameter tunning for Random Forest- choose the best max depth

Data Science Asked on June 17, 2021

I’m trying to choose the best parameters for random forest model.
For that goal I hae run my model in loop with only one parameter and each time I have changed the number for the parameter max depth. I have created two charts: one for the model score and one for the MAE.
this is how the charts look like:
enter image description here

enter image description here

My question is what is the right number to take for that parameter?
I fell likeI want to choose really low number when the training set and the test set are close to each other and high (the second point, when the accuracy is above 90) :
enter image description here

But I see that it gets stabalize much after and i’m afrid that it might lead to overfitting.

Is it true? should I choose bigger number from the stable zone or is ok to take low number as long as the training and test data are having the same accuracy and same error?

One Answer

In general, the max depth parameter should be kept at a low value in order to avoid overfitting: if the tree is deep it means that the model creates more rules at a more detailed level using fewer instances. Very often some of these rules are due to chance, i.e. they don't correspond to a real pattern in the data.

Overfitting is visible in your graphs from the quite high difference between training and test set performance. One can observe that the performance on the test set increases with the first few values of depth (I'm guessing until around 5 or so), and it starts decreasing after that. So the optimal point (performance and no overfitting) is the point that you mention, it's after this point that the model is overfit.

The stable part of the graph is probably due to another parameter (e.g. min. number of instance per leaf) which stops the model from overfitting even more. If the model was free to overfit as much as it wants, it would probably reach max performance on the training set and very low performance on the test set.

Note: I'm confused why you're using both accuracy and MAE, normally the former is for classification and the latter for regression.

Correct answer by Erwan on June 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP