TransWikia.com

Compare cross validation and test set results

Data Science Asked by Rayyan Abid Ali on January 1, 2021

I am having a hard time understanding the results of a cross validation test and a test run on a test set.

First I made the following pipeline:

pipe=Pipeline([('clf',DecisionTreeClassifier(random_state=0))])

Then I use cross validation on a scaled training set(75% of the original dataset):

>>> cross_val_score(pipe, X_train_scaled,Y_train,cv=7).mean()
0.7257796129913106

I then fit the pipeline with the training data and run the classifier on the training data.
>>> pipe.fit(X_train_scaled,Y_train)
>>> pipe.score(X_train_scaled,Y_train)
0.7734339749126584

Finally, I checked the models performance on the test set:
pipe.score(X_test_scaled, Y_test)
0.941353836876225

Question 1: have I done the right steps, do I even need to run the pipeline on the training data for the training data score?

Question 2: why is the test data so much more accurate than the cross validated one. Is the data underfitted, or is it okay for this to happen ?

One Answer

At first sight the steps seem to be correct, nevertheless:

  • you did not tell how you split your dataset intro training and test sets, in case it has some influence in your final score values
  • you might be interested in the sklearn GridSearchCV option , with which you carry out cross validation similar to what you did but you do not need to manually refit the model on the whole train dataset, because you have the option with the refit parameter (choose the scoring metric considered so as to select the best model from the grid search cross-validation process already refit on the whole train dataset wothout splits), as follows: Basically, the steps for your example could be:
  1. split your dataset to leave some rows out for a final validation

  2. apply grid search cross-validation on the rest of the data (your train set), something like:

    dt_clf = DecisionTreeClassifier(random_state=0, class_weight="balanced")
    hiperparams_search_space = {'criterion': ["gini", "entropy"], 'max_depth': [4, 5, 6, 8], 'min_samples_leaf': [2, 3, 5]}
    dec_tree_cross_val_clf = GridSearchCV(dt_clf, hiperparams_search_space, cv=10, scoring=['accuracy', 'recall', 'precision', 'roc_auc'], refit='recall',return_train_score=True, n_jobs=-1)
    dec_tree_cross_val_clf.fit(X_train, y_train)

    where for each scoring metric, you would have train and test values across your k-folds in a dataframe via dec_tree_cross_val_clf.cv_results_: enter image description here

  3. use clf.best_estimator_ to make predictions on your validation set to calculate a final score metric:

    dec_tree_cross_val_clf_best_est = dec_tree_cross_val_clf.best_estimator_

Correct answer by German C M on January 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP