Data Science Asked by Rayyan Abid Ali on January 1, 2021
I am having a hard time understanding the results of a cross validation test and a test run on a test set.
First I made the following pipeline:
pipe=Pipeline([('clf',DecisionTreeClassifier(random_state=0))])
Then I use cross validation on a scaled training set(75% of the original dataset):
>>> cross_val_score(pipe, X_train_scaled,Y_train,cv=7).mean()
0.7257796129913106
I then fit the pipeline with the training data and run the classifier on the training data.
>>> pipe.fit(X_train_scaled,Y_train)
>>> pipe.score(X_train_scaled,Y_train)
0.7734339749126584
Finally, I checked the models performance on the test set:
pipe.score(X_test_scaled, Y_test)
0.941353836876225
Question 1: have I done the right steps, do I even need to run the pipeline on the training data for the training data score?
Question 2: why is the test data so much more accurate than the cross validated one. Is the data underfitted, or is it okay for this to happen ?
At first sight the steps seem to be correct, nevertheless:
split your dataset to leave some rows out for a final validation
apply grid search cross-validation on the rest of the data (your train set), something like:
dt_clf = DecisionTreeClassifier(random_state=0, class_weight="balanced")
hiperparams_search_space = {'criterion': ["gini", "entropy"], 'max_depth': [4,
5, 6, 8], 'min_samples_leaf': [2, 3, 5]}
dec_tree_cross_val_clf = GridSearchCV(dt_clf, hiperparams_search_space, cv=10,
scoring=['accuracy', 'recall', 'precision', 'roc_auc'],
refit='recall',return_train_score=True, n_jobs=-1)
dec_tree_cross_val_clf.fit(X_train, y_train)
where for each scoring metric, you would have train and test values across your k-folds in a dataframe via dec_tree_cross_val_clf.cv_results_:
use clf.best_estimator_ to make predictions on your validation set to calculate a final score metric:
dec_tree_cross_val_clf_best_est = dec_tree_cross_val_clf.best_estimator_
Correct answer by German C M on January 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP