Data Science Asked by Harigovind Valsakumar on June 16, 2021
I got 100% accuracy on my test set when trained using decision tree algorithm.but only got 85% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provided.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
#Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf.fit(x_train, y_train);
predictions = rf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)
print(cm)
#Decision Tree
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
cm = sklearn.metrics.confusion_matrix(y_test,predictions)
Random Forest:
[[19937 1]
[ 8 52]]
Decision Tree:
[[19938 0]
[ 0 60]]
There may be a few reason this is happening.
First of all, check your code. 100% accuracy seems unlikely in any setting. How many testing data points do you have? How many training data points did you train your model on? You may have made a coding mistake and compared two same list.
Did you use different test set for testing? The high accuracy may be due to luck - try using some of the KFoldCrossValidation libraries that are widely available.
You can visualise your decision tree to find our what is happening. If it has 100% accuracy on the test set, does it have 100% on the training set ?
Correct answer by c zl on June 16, 2021
Please check if you used your test set for building the model. This is a common scenario, like:
Random Forest Classifier gives very high accuracy on test set - overfitting?
If that was the case, everything was making sense. Random Forest was trying not to overfit your model, while a decision tree would just memorize your data as a tree.
Answered by SmallChess on June 16, 2021
Agree with c zl, in my experience this doesn't sound like a stable model and points to just a random lucky cut of the data. But something that will struggle to provide similar performance on unseen data.
The best models are:
Bootstrapping and k-fold cross validation should usually provide more reliable performance numbers
Answered by Dan9ie on June 16, 2021
The default hyper-parameters of the DecisionTreeClassifier
allows it to overfit your training data.
The default min_samples_leaf
is 1
. The default max_depth
is None
. This combination allows your DecisionTreeClassifier
to grow until there is a single data point at each leaf.
Since you are having $100%$ accuracy, I would assume you have duplicates in your train
and test
splits. This has nothing to do with the way you split but the way you cleaned your data.
Can you check if you have duplicate datapoints?
x = [[1, 2, 3],
[4, 5, 6],
[1, 2, 3]]
y = [1,
2,
1]
initial_number_of_data_points = len(x)
def get_unique(X_matrix, y_vector):
Xy = list(set(list(zip([tuple(x) for x in X_matrix], y_vector))))
X_matrix = [list(l[0]) for l in Xy]
y_vector = [l[1] for l in Xy]
return X_matrix, y_vector
x, y = get_unique(x, y)
data_points_removed = initial_number_of_data_points - len(x)
print("Number of duplicates removed:", data_points_removed )
If you have duplicates in your train
and test
splits, it is expected to have high accuracies.
Answered by Bruno Lubascher on June 16, 2021
So, my advices:
Answered by avchauzov on June 16, 2021
I believe the problem you are facing is imbalance class problem. You have 99% data belongs to one class. May be the test data you have can be of that class only. Because 99% of the data belong to one class, there is high probability that your model will predict all your test data as that class. To deal with imbalance data you should use AUROC instead of accuracy. And you can use techniques like over sampling and under sampling to make it a balanced data set.
Answered by Vikram on June 16, 2021
I had a similar issue, but I realized that I had included my target variable while predicting the test outcomes.
error:
predict(object = model_nb, test[,])
void of error:
predict(object = model_nb, test[,-16])
where the 16th column was for the dependent variable.
Answered by Jeremiah Osibe on June 16, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP