Data Science Asked by Jake Morris on January 9, 2021
I have a multi-class problem that I am building a classifier for. I have N total data points I would like to predict on. If I instead predict on n < N data points and get an accuracy score, is there a way I can say (with some degree of confidence) what I think the same model’s accuracy score would be on the remaining data points?
Can somebody point me to an article discussing this, or suggest a formula to research?
Usually when working with classification problems, one tries to have 3 subsets of data:
In order for this to work properly, it is important that all the subsets are representative of the available data. For example, that the proportion of each class is approximately the same across the subsets.
In the light of such broadly used process, we can see that most algorithms are tuned and tested on a fraction of the available data. As long as the used test set is balanced in a way that is similar to the training and validation sets and that it has not been used at all during the training/tuning of the model, there is no reason why the performance scores would not generalize well to N > n samples used in the test set.
Answered by Eskapp on January 9, 2021
Use cross validation. It's where you split the data in K subsets, and train and test K times on all of the data using a different subset of the data for validation each time. The average cross validation score is generally a better estimate of the model's performance on unseen data than the standard 80/10/10 splitting that you'd use when training, validating, and testing your final model.
Many machine learning libraries, such as Python's scikit-learn, have a module for this.
Answered by Jay Speidell on January 9, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP