TransWikia.com

How to identify Overfitting in RandomForestClassifier?

Data Science Asked on January 17, 2021

Im building a sentiment classification model using RandomForestClassifier. I got the training accuracy of 99.65 & cross-validation( RepeatedStratifiedKFold-5 folds) accuracy of 97.29. I used f1 score for metrics. The dataset size is 5184 samples. The dataset is imbalanced so i’m using class_weight hyper-parameter as ‘balanced’. I have done hyper parameter tuning also. Following are the parameters i tuned –

estimator = RandomForestClassifier(random_state=42, class_weight=’balanced’, n_estimators=850, min_sample_split=4, max_depth=None, min_samples_leaf=1, max_features=’sqrt’)

Im thinking the model is overfitting.
Im also wondering is this issue caused because of the class imbalance?

Any immediate help on this is much appreciated.

One Answer

There's quite a lot of features for the number of instances, so it's indeed likely that there's some overfitting happening.

I'd suggest these options:

  • Forcing the decision trees to be less complex by setting the max_depth parameter to a low value, maybe around 3 or 4. Run the experiment with a range of values (e.g. from 3 to 10) and observe the changes in performance (preferably use a validation set, so that when the best parameter is found you can do the final evaluation on a different test set).
  • Reducing the number of features: remove rare words (i.e. those which appear less than $N$ times) and/or use some feature selection method.

Correct answer by Erwan on January 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP