Same confusion matrix when changing DecisionTreeClassifier parameters

Question

I'm trying to build my first Decision Tree Classifier using the Iris dataset in the sklearn library. This is my first sample code:
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import tree 
import numpy as np
import graphviz

iris = load_iris()

clf_ex1 = tree.DecisionTreeClassifier(criterion="entropy",random_state=300,min_samples_leaf=5,
                                      class_weight={0:1,1:10,2:10})

np.random.seed(0)

indices = np.random.permutation(len(iris.data))
indices_training=indices[:-10]
indices_test=indices[-10:]

iris_X_train = iris.data[indices_training]
iris_y_train = iris.target[indices_training]
iris_X_test  = iris.data[indices_test]
iris_y_test  = iris.target[indices_test]

clf_ex1 = clf_ex1.fit(iris_X_train, iris_y_train)

predicted_y_test = clf_ex1.predict(iris_X_test)

print(confusion_matrix(iris_y_test, predicted_y_test))

print("Predictions:")
print(predicted_y_test)
print("True classes:")
print(iris_y_test) 
print("--------")
print(iris.target_names)

# print some metrics results
acc_score = accuracy_score(iris_y_test, predicted_y_test)
print("--------")
print("Accuracy score: "+ str(acc_score))
print("--------")
f1=f1_score(iris_y_test, predicted_y_test, average='macro')
print("F1 score: "+str(f1))
print("--------")

scores = cross_val_score(clf_ex1, iris.data, iris.target, cv=5) 
print(scores)

dot_data = tree.export_graphviz(clf_ex1, out_file=None, 
                         feature_names=iris.feature_names, 
                         class_names=iris.target_names, 
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

As you can see, in my DecisionTreeClassifier function I set the weight of the classes by giving a major value to the second one and the third one and I've given 300 to the random_state parameter. Then, I made another example by changing these parameters on this way:
clf_ex2 = tree.DecisionTreeClassifier(criterion="entropy",random_state=300,min_samples_leaf=5,
                                  class_weight={0:1,1:1,2:10})

and on this way:
clf_ex3 = tree.DecisionTreeClassifier(class_weight=None, criterion='entropy', 
                                  max_depth=2, 
                                  max_leaf_nodes=None, 
                                  min_samples_leaf=15, 
                                  min_samples_split=5, 
                                  random_state=100, 
                                  splitter='best')

The problem is that all the values that I print (the confusion matrix, the accuracy, the predicted_y_test and the f1 score) do not change between the three codes. The only value that gets affected is the Cross Validation Score. Why?

Oliver Foster · Accepted Answer

I ran your script and this is what was returned for all 3 confusion matrices:
[[4 0 0]
 [0 3 1]
 [0 0 2]]

This confusion matrix indicates to me that your model is working wonderfully on the testing data. There is a single exception where your model predicts the wrong class (the non-diagonal "1").
Since you are giving the same random seed every time, your training and testing indices are the same for all 3 models you are investigating. Meaning: they are all being trained on the same data and tested on the same data.
It is quite possible that, in this test data set, 9 out of 10 samples are very easy to classify, while there is 1 case where it is much more non-trivial. The contrast between trivial and non-trivial might be so stark that adjusting hyperparameters in your DecisionTree isn't yielding any difference in performance.
So this could explain why your output is the same for all those fields you mentioned.
As for the cross_val_score - this method randomly mixes your entire dataset and gives you 5 successive CV trials. As this process is actually random from your first model through to your third (it doesn't depend on your np.random.seed(0)), the splits are different and you will yield similar (but not identical) results.
As a general note - be careful how you're currently using cross_val_score. In your current implementation you are feeding a pre-trained model to a method where it expects an un-trained model, and you are asking it to cross validate. As a result it's likely that cross_val_score is re-training your model in a way that you might not be intending.

Same confusion matrix when changing DecisionTreeClassifier parameters

One Answer

Add your own answers!

Ask a Question