When developing machine learning models, is the size of each class in the test set important?

Question

I am thinking about the prospective application of a trained classifier in a real-world context. We know that when we do over/under-sampling to balance our dataset, we never touch the testing set as we want to keep our dataset's real behaviour. But the part that I do not understand is the role of the test set's distribution in a classifier's performance.
Let's say I have a model that can label an email as spam or non-spam. If I launch this model in my email-service, in a specific time window, all the emails that my classifier receives might be non-spam. But the trained model has a 50-50% distribution for each category. My question is, does this difference in the distribution--during the prospective application-- change the performance of the model? e.g. if my web-service receives 5 spams and 5 non-spams in that time window, should I receive a more accurate classification? Based on my understanding, the answer should be a No. Still, I see everywhere that people are talking about the importance of the test distribution and its role in the performance and accuracy of predictive models.
Thank you.

Noah Weber · Answer

Depends if you are going to do online learning.
Lets say you will do online learning/incremental learning than test set Distribution will make difference. For example because of catastrophic forgetting of neural Networks.
If you are making Batch predictions than it makes no difference whats the test set Distribution. Model knows no difference since it does not Change ist state.

When developing machine learning models, is the size of each class in the test set important?

One Answer

Add your own answers!

Ask a Question