TransWikia.com

When developing machine learning models, is the size of each class in the test set important?

Data Science Asked on February 12, 2021

I am thinking about the prospective application of a trained classifier in a real-world context. We know that when we do over/under-sampling to balance our dataset, we never touch the testing set as we want to keep our dataset’s real behaviour. But the part that I do not understand is the role of the test set’s distribution in a classifier’s performance.

Let’s say I have a model that can label an email as spam or non-spam. If I launch this model in my email-service, in a specific time window, all the emails that my classifier receives might be non-spam. But the trained model has a 50-50% distribution for each category. My question is, does this difference in the distribution--during the prospective application-- change the performance of the model? e.g. if my web-service receives 5 spams and 5 non-spams in that time window, should I receive a more accurate classification? Based on my understanding, the answer should be a No. Still, I see everywhere that people are talking about the importance of the test distribution and its role in the performance and accuracy of predictive models.

Thank you.

One Answer

Depends if you are going to do online learning.

Lets say you will do online learning/incremental learning than test set Distribution will make difference. For example because of catastrophic forgetting of neural Networks.

If you are making Batch predictions than it makes no difference whats the test set Distribution. Model knows no difference since it does not Change ist state.

Answered by Noah Weber on February 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP