Classification with noisy labels, noise is structured and not random

Cross Validated Asked on November 14, 2021

I am building a classification model with mislabeled training data on the order of ~70% of the training data is labeled correctly and ~30% is labeled incorrectly. Knowing this, how can I quantify the error rate for my model? For example, if I have 85% accuracy on the test set, of those 85% how many come from the 70% that are actually labeled correctly?

I also have to say that the labels aren’t mislabeled completely randomly either. There is certainly a relationship between my predictors and whether or not the label is correct. I have a few hundred possible labels and around 1 million records. The data are survey responses describing occupations. So common mislabellings will have write ins that contain words such as “Office manager” where this could land in any number of codes.

Is there any literature on this? Maybe some sort of confidence interval I can build for the error rate?

classification errors in training generalization error references

Add your own answers!

Ask a Question

Get help from others!