Why my classification accuracy is high for both training and testing data?

Question

I have a dataset with 10 features and 1 binary classification target. I tested this dataset with decision tree classifier. I did some basic check like missing values but the data looks clean. My classification accuracy for both training and testing data is really high and it's looks suspicious. I would like to know whether I am doing any mistake or is there any way to explain why the accuracy is too high?
Can anyone advise me here?
import pandas as pd
from sklearn.model_selection import KFold, StratifiedKFold, RepeatedKFold, RepeatedStratifiedKFold, cross_validate, train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSvUmbtHUh2e0iYj7nMDaP8Tf_pCnCa-HrWwAmaxrERxxvd2y_5qxuSP10t6db4RUSjTdOi9WshZhoR/pub?output=csv')

input_selected_features = df.drop(labels = 'Target', axis = 1)
target_selected_feature = df['Target']
X_train, X_test, y_train, y_test = train_test_split(input_selected_features, target_selected_feature, test_size = 0.2, train_size = 0.8, random_state = 101)

# k fold cross validation scores
kf = KFold(n_splits = 10, shuffle = True)
cv_results = cross_validate(estimator = DecisionTreeClassifier(), X = input_selected_features, y = target_selected_feature, cv = kf, scoring = ['accuracy', 'f1'], return_train_score = True)

# print(cv_results)
print('training accuracy - ',cv_results['train_accuracy'].mean())
print('testing accuracy - ',cv_results['test_accuracy'].mean())
print('training f1 score - ',cv_results['train_f1'].mean())
print('testing f1 score - ',cv_results['test_f1'].mean())
```

Orkun Berk Yuzbasioglu · Answer

Your data contains 7621 records and 3873 of them are unique records.
When there is duplicates in different folds, cross-validation fails. And these duplicates causes to high train&test set performance metrics. Removing the duplicates decreases the test set's accuracy and f-1 scores:
import pandas as pd
from sklearn.model_selection import KFold,cross_validate, train_test_split
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSvUmbtHUh2e0iYj7nMDaP8Tf_pCnCa-HrWwAmaxrERxxvd2y_5qxuSP10t6db4RUSjTdOi9WshZhoR/pub?output=csv', index_col=0)
df.drop_duplicates(inplace=True)

input_selected_features = df.drop(labels = ['Target'], axis = 1)
target_selected_feature = df['Target']

# k fold cross validation scores
kf = KFold(n_splits = 10, shuffle = True, random_state=42)
cv_results = cross_validate(estimator = DecisionTreeClassifier(), X = input_selected_features, y = target_selected_feature, cv = kf, scoring = ['accuracy', 'f1'], return_train_score = True)

# print(cv_results)
print('training accuracy - ',cv_results['train_accuracy'].mean())
print('testing accuracy - ',cv_results['test_accuracy'].mean())
print('training f1 score - ',cv_results['train_f1'].mean())
print('testing f1 score - ',cv_results['test_f1'].mean())

> training accuracy -  0.9990828974566813
> testing accuracy -  0.7454174325368284
> training f1 score -  0.9990071171229987
> testing f1 score -  0.727174422877005

Note that, also the first column in the original csv file is an index column (1,2,..). It is better to read that column with index_col argument.

Why my classification accuracy is high for both training and testing data?

One Answer

Add your own answers!

Ask a Question