Data Science Asked on March 23, 2021
Hello, when i’m training my model with 80% data and testing with 20% data the accuracy is 49%. And when i’m training my data without splitting it’s giving around 99%. I’m confused. Please help me with this
The below code is with split which got 49% accuracy
data = pd.read_csv(r"dataset.csv")
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
objList = data.select_dtypes(include = "object").columns
for feat in objList:
data[feat] = le.fit_transform(data[feat].astype(str))
X = data.iloc[:, data.columns != 'Outcome'].values
y = data.iloc[:, data.columns == 'Outcome'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
from sklearn.ensemble import RandomForestClassifier
randomforest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
randomforest.fit(X_train, y_train)
y_pred = randomforest.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print(cm)
The below code is without split which got 99% accuracy
data = pd.read_csv(r'dataset.csv')
le = LabelEncoder()
objList = data.select_dtypes(include = "object").columns
for feat in objList:
data[feat] = le.fit_transform(data[feat].astype(str))
X = data.iloc[:, data.columns != 'Outcome'].values
y = data.iloc[:, data.columns == 'Outcome'].values
file = open(r'finalized_model.pkl', 'rb')
data = pickle.load(file)
y_pred = data.predict(X)
from sklearn.metrics import confusion_matrix
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))
cm = confusion_matrix(y, y_pred)
print(cm)
Total data is the same used for 2 codes.
When you split the data , you are training your model on 80 percent of the dataset , but you are finding the accuracy of the remaining 20 percent of the data.
And when you are not splitting the data your model is learning and calculating accuracy from a similar dataset , hence the accuracy is pretty high.
You should:
1.Always split the data and try to achieve high accuracy on the test set rather than the training set.
2.If your training data accuracy is quite higher than the test data this means your modell is overfitting.
Answered by Shiv on March 23, 2021
Your model is overfitting to your training data. You need to try to tune the hyperpharameters so it will get a good accuracy when the you split the data. Otherwise model is not that useful.
Answered by gihan on March 23, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP