How can I improve the accuracy of my model? (Cab Cancellation Prediction)

Question

I am trying to predict based on several parameters like trip type, car type, source of booking, start time, lead time (start- book) and a few other params whether or not a customer will cancel. From the code below the accuracy of default.ct the 1st classification I do is giving me an accuracy of 75%. deeper.ct the deeper tree that I am generating is giving me an accuracy of 70%. Progressively the accuracy of the pruned tree also is remaining the same. Boosting with adabag package is taking way too long because I’ve nearly 5,00,000 observations across 19 variables. xgboost is giving me the best mlogloss value at about 0.43.

What can I do to improve the accuracy of the model?

    # Generate classification tree
    default.ct <- rpart(tag ~ ., data = train.df, method = "class", 
    control=rpart.control(minsplit=2, minbucket=1, cp=0.001))
    summary(default.ct)$used
    printcp(default.ct)

   # generate confusion matrix for training data
   prp(default.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = 
  -10)
  default.ct.point.pred.train <- predict(default.ct,train.df,type = "class")
  confusionMatrix(default.ct.point.pred.train, train.df$tag)

    deeper.ct <- rpart(tag ~ ., data = train.df, method = "class", cp = 0, 
   minsplit = 1)
   # count number of leaves
   length(deeper.ct$frame$var[deeper.ct$frame$var == "<leaf>"])

   ## Use cross-validation to prune the tree
   cv.ct <- rpart(tag ~ ., data = train.df, method = "class", cp = 0, minsplit = 
   5, xval = 5)
   # use printcp() to print the table. 

   printcp(cv.ct)
   # Use variable c to store accuracy data for different cp and print it out
   c <- list()

   for (i in 1:nrow(cv.ct$cptable)){
   pruned.ct <- prune(cv.ct, 
                  cp = cv.ct$cptable[i])
   pruned.ct.point.pred.train <- predict(pruned.ct,valid.df,type = "class")
   c[i] <- confusionMatrix(pruned.ct.point.pred.train, valid.df$tag)$overall[1]
   }


    # prune the tree with second large cp and use it to predict validation data 
    pruned.ct <- prune(cv.ct, cp = cv.ct$cptable[2])
length(pruned.ct$frame$var[pruned.ct$frame$var == "<leaf>"])

classification machine learning predictive modeling r

Rohan · Answer

Create addition variable: Eg: lead_time-start_time can be time to book.
Reduce variables with many classes if present (part of EDA)
standardize numeric variables - (val-mean)/sigma
Tree is a very weak classifier, you will have to do bagging or boosting (like ada boost or gbm or random forest
try parameter tuning - I am not pasting any links since I don;t know the policy for advertising over here, but just search GBM parameter tuning in google - you'll get multiple links

You have mentioned time issues when you tried ensemble methods. To solve that:

take a subset of data and try ensembling on that (once you finalize the model, run it on whole dataset)
if you are using XGBoost, you have an option of taking in another model as an input (you can run this in batches)

How can I improve the accuracy of my model? (Cab Cancellation Prediction)

One Answer

Add your own answers!

Ask a Question