Learning Curves and interpretations

Question

I've trained 4 classifiers on an undersampled dataset.
I plotted the learning curve for each classifier and I got the following results :

I see that for the Log Reg, both curves seem to converge and that adding more data will not help at some point.
For the SVC I have no idea (rather than adding more data seems good ! )
for Knn : adding more data will increase both accuracy
for Random Forest : I have no idea.
I would love to understand how to read these curves. Thank you very much ! :)

MartinM · Accepted Answer

In general, the further away the green line is from the red line, the more the model is overfitting, however eventually enough data will cure all overfitting (there will be so much data the model can't possibly memorize all of it), and that's why the lines converge to being together (stops memorizing, red line goes down, starts generalising, green line goes up). Some models need more data to learn than others however, and so as you can see, the LogisticRegression model reaches it's best performance much faster than, for example, the SVC.
An interesting case is the KNN, who's red line doesn't go down, but rather up. I'm pretty sure the reason for this is to do with how the KNN works, it compares instances it knows to classify new instances. Thus, the KNN doesn't really memorize... New instances it can compare with will never hinder it's performance on the training set (red line). However it too will also eventually converge, the lines together.

Learning Curves and interpretations

One Answer

Add your own answers!

Ask a Question