How can i tell if my model is overfitting from the distribution of predicted probabilities?

Question

all,
i am training light gradient boosting and have used all of the necessary parameters to help in over fitting.i plot the predicted probabilities (i..e probabililty has cancer) distribution from the model (after calibrating using calibrated classifier) i.e. their histogram or kde. as you can see from below the probabilities for my class 1 are concentrated on the upper and lower end.
i have tried playing around with bandwith too to smooth this a little and it doesn't smooth the bumps too much. what do you think this shows about my model? isn't it a good thing that the model for class 1 (which is has cancer) is assigning a greater probability for this class?
i am unsure how to interpret this or where i could be going wrong

the red curve is positive class (has cancer) and the blue curve is hasn't. below is plot used to generate.
results = df[['label','predicted_prob']]

colors = ['b', 'r']

for label in [0, 1]:
    results[results['label'] == label]['predicted_prob'].plot.kde(bw_method=0.35,color=colors[label])
plt.xlim(0,1)

Ben Reiniger · Accepted Answer

Such a plot doesn't really tell you much about overfitting.
First, check that your calibration has worked well; it's possible that an incorrect calibration has pushed the probabilities to the extremes.  Otherwise, the distribution of probabilities being so extreme suggests the data just naturally separates into a segment of easy-to-detect cancers and the rest.  Among the latter, it looks like you get reasonably good but not great rank-ordering of cases.

How can i tell if my model is overfitting from the distribution of predicted probabilities?

One Answer

Add your own answers!

Ask a Question