TransWikia.com

How to get probability of classification

Data Science Asked on July 8, 2021

I have the binary classification, I tried several models KNN, SVM, decision tree, and random forest. I have 50 000 samples, X_train has 50 000 rows and 2300 columns. Everything works well, but I want to build some semi-supervised model because I have some unlabeled samples. In this case, I need to get the probability of classification that I tried, but it doesn’t work.

At first, I tried it for KNN

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 1, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classifier.predict_proba(X_test[0]))

I get [[1. 0.]]. I don’t understand why it is 1? (as first I thought it is 100%, but I get it for all test samples)

Then I tried it for the decision tree

classifier = DecisionTreeClassifier(random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classifier.predict_proba(X_test[0]))

I get [[1. 0.]] too. Why it is an integer?

One Answer

It is indeed a probability of 1 because you didn't change the default parameters.

The probability for KNN is the average of all the neighbors. If there is only one neighbor n_neighbor=1 it can only be 1 or 0.

The DecisionTreeClassifier expands until all the training data is classified perfectly if you don't control the depth. Again, this likely led to overfitting and to extreme probability predictions as a result. You should try different values for max_depth and see what works best. You can do say by performing cross validation. (If you are unfamiliar with these I recommend reading up on it first.)

Correct answer by oW_ on July 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP