Data Science Asked on January 17, 2021
Im building a sentiment classification model using RandomForestClassifier. I got the training accuracy of 99.65 & cross-validation( RepeatedStratifiedKFold-5 folds) accuracy of 97.29. I used f1 score for metrics. The dataset size is 5184 samples. The dataset is imbalanced so i’m using class_weight hyper-parameter as ‘balanced’. I have done hyper parameter tuning also. Following are the parameters i tuned –
estimator = RandomForestClassifier(random_state=42, class_weight=’balanced’, n_estimators=850, min_sample_split=4, max_depth=None, min_samples_leaf=1, max_features=’sqrt’)
Im thinking the model is overfitting.
Im also wondering is this issue caused because of the class imbalance?
Any immediate help on this is much appreciated.
There's quite a lot of features for the number of instances, so it's indeed likely that there's some overfitting happening.
I'd suggest these options:
max_depth
parameter to a low value, maybe around 3 or 4. Run the experiment with a range of values (e.g. from 3 to 10) and observe the changes in performance (preferably use a validation set, so that when the best parameter is found you can do the final evaluation on a different test set).Correct answer by Erwan on January 17, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP