Data Science Asked on July 8, 2021
I have the binary classification, I tried several models KNN, SVM, decision tree, and random forest. I have 50 000 samples, X_train
has 50 000 rows and 2300 columns. Everything works well, but I want to build some semi-supervised model because I have some unlabeled samples. In this case, I need to get the probability of classification that I tried, but it doesn’t work.
At first, I tried it for KNN
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 1, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classifier.predict_proba(X_test[0]))
I get [[1. 0.]]
. I don’t understand why it is 1? (as first I thought it is 100%, but I get it for all test samples)
Then I tried it for the decision tree
classifier = DecisionTreeClassifier(random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classifier.predict_proba(X_test[0]))
I get [[1. 0.]]
too. Why it is an integer?
It is indeed a probability of 1 because you didn't change the default parameters.
The probability for KNN is the average of all the neighbors. If there is only one neighbor n_neighbor=1
it can only be 1 or 0.
The DecisionTreeClassifier
expands until all the training data is classified perfectly if you don't control the depth. Again, this likely led to overfitting and to extreme probability predictions as a result. You should try different values for max_depth
and see what works best. You can do say by performing cross validation. (If you are unfamiliar with these I recommend reading up on it first.)
Correct answer by oW_ on July 8, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP