Logistic regression vs Random Forest on imbalanced data set

Question

I have an imbalanced data set where positives are just 10% of the whole sample. I am using logistic regression and random forest for classification. While comparing the results of these models, I have found that the probability output of logistic regression ranges between [0,1] while that of random forest ranges between [0, 0.6].
I cannot share the data set but my doubt is around the working of these algorithms. How can random forest generate probability less than 0.6?

mirimo · Answer

To have a probability of 1 in a RF, it means that your algorithm can construct a leaf containing only positive sample. Since it doesn't, this means that your features are not explaining the variance of the output or that your algorithm is under-fitted.
I suggest that you try optimize the hyper-parameters of your RF by using cross-validation and use some oversampling to reduce the bias in your dataset.

Logistic regression vs Random Forest on imbalanced data set

One Answer

Add your own answers!

Ask a Question