Misclassification Rate for Random Forest Plateauing too Early

Question

Using R, I have created 5 different random forest models using 5 different numbers of trees (3,10,30,100,300). My intention was to compute the misclassification rates of each of these models and plot the rates against the number of trees to illustrate the idea that generally, an increase in trees in a random forest model correlates with a decreasing misclassification rate.

I had a few colleagues run this same model in Python and with all of them, their model reached a misclassification rate of ~0.08 with the 300-tree model. However, When I run my models in R, the misclassification rate seems to level out around ~0.2 at the 100-tree model, and does not get any lower with the ~300 tree model. I'm curious as to what may be causing this discrepancy. I've provided my code below.

madelon_train <- data.frame(madelon_train_data, madelon_train_labels)
for(i in c(3,10,30,100,300)){
    assign(paste("madelonforest", i, sep = ""), 
    randomForest(as.factor(madelon_train$V1.1) ~ ., data = madelon_train, ntree = 
    i, mtry = sqrt(500), replace = FALSE))  
}

modellist <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300)){
    modellist[[i]] <- eval(as.name(paste("madelonforest", i, sep = "")))
}

#Use models to predict training data and compute misclassification error

classerrlisttrain <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300)){
    err <-table(as.numeric(as.character(predict(modellist[[i]], 
    madelon_train_data, type = 'class', OOB = TRUE))) - madelon_train_labels)
    classerrlisttrain[[i]] <- assign(paste("misclassification", i, sep = ""), 
err[names(err)==0])
}

for(i in c(3,10,30,100,300)){
    classerrlisttrain[[i]] = as.double(classerrlisttrain[[i]])
    classerrlisttrain[[i]] = 1 - 
classerrlisttrain[[i]]/length(madelon_train_labels$V1)
}

#Use models to predict test data and compute misclassification error

classerrlisttest <- vector(mode="list", length=5)
for(i in c(3,10,30,100,300)){
    err <-table(as.numeric(as.character(predict(modellist[[i]], 
    madelon_valid_data, type = 'class'))) - madelon_valid_labels)
    classerrlisttest[[i]] <- assign(paste("misclassification", i, sep = ""), 
err[names(err)==0])
}

for(i in c(3,10,30,100,300)){
    classerrlisttest[[i]] = as.double(classerrlisttest[[i]])
    classerrlisttest[[i]] = 1 - 
classerrlisttest[[i]]/length(madelon_valid_labels$V1)
}

#Plot misclassification errors vs Tree Depth

plot(c(3,10,30,100,300), classerrlisttrain[c(3,10,30,100,300)], type = 'l', 
xlab = 'Number of Trees', ylab = 'Misclassification Rate', xlim = c(1,300), 
ylim = c(0,0.5), col = "red")
lines(c(3,10,30,100,300), classerrlisttest[c(3,10,30,100,300)], type = 'l', 
col = "blue")
legend(1,0.1,legend = c("Train Data", "Test Data"), col = 
c("red","blue"),lty=1, cex=0.8)

user2974951 · Answer

If you and your colleagues ran the same model on the same data you should get the same results (give or take a stochastic error). Did your colleagues use the same environment, same packages and same versions?

Also, it is known that building more trees gives better performance and if possible you should build more not less, as RF does not overfit with more trees, the error / accuracy stabilizes at some point. What that point is (number of trees) varies from data to data, so you cannot really determine this beforehand.

Misclassification Rate for Random Forest Plateauing too Early

One Answer

Add your own answers!

Ask a Question