Data Science Asked by JSVJ on August 13, 2021
I have a dataset with 10 features and 1 binary classification target. I tested this dataset with decision tree classifier. I did some basic check like missing values but the data looks clean. My classification accuracy for both training and testing data is really high and it’s looks suspicious. I would like to know whether I am doing any mistake or is there any way to explain why the accuracy is too high?
Can anyone advise me here?
import pandas as pd
from sklearn.model_selection import KFold, StratifiedKFold, RepeatedKFold, RepeatedStratifiedKFold, cross_validate, train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSvUmbtHUh2e0iYj7nMDaP8Tf_pCnCa-HrWwAmaxrERxxvd2y_5qxuSP10t6db4RUSjTdOi9WshZhoR/pub?output=csv')
input_selected_features = df.drop(labels = 'Target', axis = 1)
target_selected_feature = df['Target']
X_train, X_test, y_train, y_test = train_test_split(input_selected_features, target_selected_feature, test_size = 0.2, train_size = 0.8, random_state = 101)
# k fold cross validation scores
kf = KFold(n_splits = 10, shuffle = True)
cv_results = cross_validate(estimator = DecisionTreeClassifier(), X = input_selected_features, y = target_selected_feature, cv = kf, scoring = ['accuracy', 'f1'], return_train_score = True)
# print(cv_results)
print('training accuracy - ',cv_results['train_accuracy'].mean())
print('testing accuracy - ',cv_results['test_accuracy'].mean())
print('training f1 score - ',cv_results['train_f1'].mean())
print('testing f1 score - ',cv_results['test_f1'].mean())
```
Your data contains 7621 records and 3873 of them are unique records.
When there is duplicates in different folds, cross-validation fails. And these duplicates causes to high train&test set performance metrics. Removing the duplicates decreases the test set's accuracy and f-1 scores:
import pandas as pd
from sklearn.model_selection import KFold,cross_validate, train_test_split
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSvUmbtHUh2e0iYj7nMDaP8Tf_pCnCa-HrWwAmaxrERxxvd2y_5qxuSP10t6db4RUSjTdOi9WshZhoR/pub?output=csv', index_col=0)
df.drop_duplicates(inplace=True)
input_selected_features = df.drop(labels = ['Target'], axis = 1)
target_selected_feature = df['Target']
# k fold cross validation scores
kf = KFold(n_splits = 10, shuffle = True, random_state=42)
cv_results = cross_validate(estimator = DecisionTreeClassifier(), X = input_selected_features, y = target_selected_feature, cv = kf, scoring = ['accuracy', 'f1'], return_train_score = True)
# print(cv_results)
print('training accuracy - ',cv_results['train_accuracy'].mean())
print('testing accuracy - ',cv_results['test_accuracy'].mean())
print('training f1 score - ',cv_results['train_f1'].mean())
print('testing f1 score - ',cv_results['test_f1'].mean())
> training accuracy - 0.9990828974566813
> testing accuracy - 0.7454174325368284
> training f1 score - 0.9990071171229987
> testing f1 score - 0.727174422877005
Note that, also the first column in the original csv file is an index column (1,2,..). It is better to read that column with index_col
argument.
Answered by Orkun Berk Yuzbasioglu on August 13, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP