TransWikia.com

Hello, when i'm training my model with 80% data and testing with 20% data the accuracy is 49% and without split it's 99%

Data Science Asked on March 23, 2021

Hello, when i’m training my model with 80% data and testing with 20% data the accuracy is 49%. And when i’m training my data without splitting it’s giving around 99%. I’m confused. Please help me with this

The below code is with split which got 49% accuracy

data = pd.read_csv(r"dataset.csv")

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

objList = data.select_dtypes(include = "object").columns

for feat in objList:
    data[feat] = le.fit_transform(data[feat].astype(str))

X = data.iloc[:, data.columns != 'Outcome'].values
y = data.iloc[:, data.columns == 'Outcome'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
randomforest.fit(X_train, y_train)

y_pred = randomforest.predict(X_test)

from sklearn.metrics import confusion_matrix
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print(cm)

The below code is without split which got 99% accuracy

data = pd.read_csv(r'dataset.csv')

le = LabelEncoder()

objList = data.select_dtypes(include = "object").columns

for feat in objList:
    data[feat] = le.fit_transform(data[feat].astype(str))

X = data.iloc[:, data.columns != 'Outcome'].values
y = data.iloc[:, data.columns == 'Outcome'].values

file = open(r'finalized_model.pkl', 'rb')

data = pickle.load(file)

y_pred = data.predict(X)

from sklearn.metrics import confusion_matrix
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))
cm = confusion_matrix(y, y_pred)
print(cm)

Total data is the same used for 2 codes.

2 Answers

When you split the data , you are training your model on 80 percent of the dataset , but you are finding the accuracy of the remaining 20 percent of the data.
And when you are not splitting the data your model is learning and calculating accuracy from a similar dataset , hence the accuracy is pretty high.

You should:
1.Always split the data and try to achieve high accuracy on the test set rather than the training set.
2.If your training data accuracy is quite higher than the test data this means your modell is overfitting.

Answered by Shiv on March 23, 2021

Your model is overfitting to your training data. You need to try to tune the hyperpharameters so it will get a good accuracy when the you split the data. Otherwise model is not that useful.

Answered by gihan on March 23, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP