Data Science Asked by user872009 on July 15, 2021
I run the code below:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from numpy import sqrt
from numpy import argmax
from sklearn.metrics import roc_curve
from sklearn.preprocessing import StandardScaler
def standardize(variable):
return (variable - np.mean(variable)) / np.std(variable)
def normalize(x):
return (x-x.min()/(x.max()- x.min()))
data.columns = np.arange(len(data.columns))
trainX, testX, trainy, testy=train_test_split(X,y,test_size=0.5,random_state=2, stratify=y)
# fit a model
model = LogisticRegression(solver='lbfgs')
model.fit(trainX, trainy)
#yhat = model.predict_proba(testX)
yhat = normalize(testX.values)
yhat = yhat[:, 0]
print(yhat)
# calculate roc curves
fpr, tpr, thresholds = roc_curve(testy, yhat)
#print(thresholds)
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-mean=%.3f' % (thresholds[ix], gmeans[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.', label='Logistic')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
The optimal threshold score is: Best Threshold= -0.049752, G-mean=0.889
Why is the optimal threshold a negative number? What does it mean? And why am I getting a negative number?
I'm not familiar with the way you are obtaining the optimal threshold, but It might be a little bit easier.
What you are looking for is the leftmost point in the x-axis (false positive rate) and the rightmost point in the y-axis (true positive rate) So by calculating the difference between the two you will have so.
from sklearn.metrics import roc_curve
yhat = best_model.predict_proba(X_train)[:,1]
fpr, tpr, thresholds = roc_curve(y_train, yhat)
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
This threshold will give you the lowest false positive rate and the highest true positive rate
EDIT
I just notice that you are passing the $P(Y = 0| X)$ i.e yhat = yhat[:, 0]
try passing yhat = yhat[:, 1]
Answered by Julio Jesus on July 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP