Hello, when i'm training my model with 80% data and testing with 20% data the accuracy is 49% and without split it's 99%

Question

Hello, when i'm training my model with 80% data and testing with 20% data the accuracy is 49%. And when i'm training my data without splitting it's giving around 99%. I'm confused. Please help me with this
The below code is with split which got 49% accuracy
data = pd.read_csv(r"dataset.csv")

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

objList = data.select_dtypes(include = "object").columns

for feat in objList:
    data[feat] = le.fit_transform(data[feat].astype(str))

X = data.iloc[:, data.columns != 'Outcome'].values
y = data.iloc[:, data.columns == 'Outcome'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
randomforest.fit(X_train, y_train)

y_pred = randomforest.predict(X_test)

from sklearn.metrics import confusion_matrix
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print(cm)

The below code is without split which got 99% accuracy
data = pd.read_csv(r'dataset.csv')

le = LabelEncoder()

objList = data.select_dtypes(include = "object").columns

for feat in objList:
    data[feat] = le.fit_transform(data[feat].astype(str))

X = data.iloc[:, data.columns != 'Outcome'].values
y = data.iloc[:, data.columns == 'Outcome'].values

file = open(r'finalized_model.pkl', 'rb')

data = pickle.load(file)

y_pred = data.predict(X)

from sklearn.metrics import confusion_matrix
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))
cm = confusion_matrix(y, y_pred)
print(cm)

Total data is the same used for 2 codes.

Shiv · Answer

When you split the data , you are training your model on 80 percent of the dataset , but you are finding the accuracy of the remaining 20 percent of the data.
And when you are not splitting the data your model is learning and calculating accuracy from a similar dataset , hence the accuracy is pretty high.
You should:
1.Always split the data and try to achieve high accuracy on the test set rather than the training set.
2.If your training data accuracy is quite higher than the test data this means your modell is overfitting.

gihan · Answer

Your model is overfitting to your training data. You need to try to tune the hyperpharameters so it will get a good accuracy when the you split the data. Otherwise model is not that useful.

Hello, when i'm training my model with 80% data and testing with 20% data the accuracy is 49% and without split it's 99%

2 Answers

Add your own answers!

Ask a Question