Compare cross validation and test set results

Question

I am having a hard time understanding the results of a cross validation test and a test run on a test set.
First I made the following pipeline:
pipe=Pipeline([('clf',DecisionTreeClassifier(random_state=0))])
Then I use cross validation on a scaled training set(75% of the original dataset):
>>> cross_val_score(pipe, X_train_scaled,Y_train,cv=7).mean()
0.7257796129913106
I then fit the pipeline with the training data and run the classifier on the training data.
>>> pipe.fit(X_train_scaled,Y_train) 
>>> pipe.score(X_train_scaled,Y_train)
0.7734339749126584 
Finally, I checked the models performance on the test set:
pipe.score(X_test_scaled, Y_test)
0.941353836876225
Question 1: have I done the right steps, do I even need to run the pipeline on the training data for the training data score?
Question 2: why is the test data so much more accurate than the cross validated one. Is the data underfitted, or is it okay for this to happen ?

German C M · Accepted Answer

At first sight the steps seem to be correct, nevertheless:

you did not tell how you split your dataset intro training and test sets, in case it has some influence in your final score values
you might be interested in the sklearn GridSearchCV option , with which you carry out cross validation similar to what you did but you do not need to manually refit the model on the whole train dataset, because you have the option with the refit parameter (choose the scoring metric considered so as to select the best model from the grid search cross-validation process already refit on the whole train dataset wothout splits), as follows:
Basically, the steps for your example could be:

split your dataset to leave some rows out for a final validation

apply grid search cross-validation on the rest of the data (your train set), something like:
dt_clf = DecisionTreeClassifier(random_state=0, class_weight="balanced") 
hiperparams_search_space = {'criterion': ["gini", "entropy"], 'max_depth': [4,
5, 6, 8], 'min_samples_leaf': [2, 3, 5]} 
dec_tree_cross_val_clf = GridSearchCV(dt_clf, hiperparams_search_space, cv=10,
scoring=['accuracy', 'recall', 'precision', 'roc_auc'],
refit='recall',return_train_score=True, n_jobs=-1) 
dec_tree_cross_val_clf.fit(X_train, y_train)
where for each scoring metric, you would have train and test values across your k-folds in a dataframe via dec_tree_cross_val_clf.cv_results_:

use clf.best_estimator_ to make predictions on your validation set to calculate a final score metric:
dec_tree_cross_val_clf_best_est = dec_tree_cross_val_clf.best_estimator_

Compare cross validation and test set results

One Answer

Add your own answers!

Ask a Question