Data Science Asked on July 26, 2021
I’ve trained 4 classifiers on an undersampled dataset.
I plotted the learning curve for each classifier and I got the following results :
I see that for the Log Reg, both curves seem to converge and that adding more data will not help at some point.
For the SVC I have no idea (rather than adding more data seems good ! )
for Knn : adding more data will increase both accuracy
for Random Forest : I have no idea.
I would love to understand how to read these curves. Thank you very much ! 🙂
In general, the further away the green line is from the red line, the more the model is overfitting, however eventually enough data will cure all overfitting (there will be so much data the model can't possibly memorize all of it), and that's why the lines converge to being together (stops memorizing, red line goes down, starts generalising, green line goes up). Some models need more data to learn than others however, and so as you can see, the LogisticRegression
model reaches it's best performance much faster than, for example, the SVC
.
An interesting case is the KNN
, who's red line doesn't go down, but rather up. I'm pretty sure the reason for this is to do with how the KNN
works, it compares instances it knows to classify new instances. Thus, the KNN
doesn't really memorize... New instances it can compare with will never hinder it's performance on the training set (red line). However it too will also eventually converge, the lines together.
Correct answer by MartinM on July 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP