Measuring performance of different classifiers with different sample sizes

Question

I'm currently using several different classifiers on various entities extracted from text, and using precision/recall as a summary of how well each separate classifier performs across a given dataset.

I'm wondering if there's a meaningful way of comparing the performance of these classifiers in a similar way, but which also takes into account the total numbers of each entity in the test data that's being classified?

Currently, I'm using precision/recall as a measure of performance, so might have something like:

Precision Recall
Person classifier   65%       40%
Company classifier  98%       90%
Cheese classifier   10%       50%
Egg classifier      100%      100%

However, the dataset I'm running these on might contain 100k people, 5k companies, 500 cheeses, and 1 egg.

So is there a summary statistic I can add to the above table which also takes into account the total number of each item? Or is there some way of measuring the fact that e.g. 100% prec/rec on the Egg classifier might not be meaningful with only 1 data item?

Let's say we had hundreds of such classifiers, I guess I'm looking for a good way to answer questions like "Which classifiers are underperforming? Which classifiers lack sufficient test data to tell whether they're underperforming?".

adesantos · Answer

In my opinion, it is difficult to compare the performance when there is such a big difference of size. On this link, (please check it out here in Wikipedia), you may see different strategies.
The one I suggest is one related to the variance. For instance, consider the performance of the classifier (100%) and the person classifier (65%). The minimum error you commit with the former classifier is 100%. However, the minimum error you can commit with the latter classifier is 10e-5.
So one way to compare classifier is to have on mind this Rule of Three where you can compare the performance and its variability.
Other possibility is F-measure which is a combination of Precision and Recall and it is somehow independent to the effect size.

Christopher Louden · Answer

You need to look at the confidence interval of the statistic.  This helps measure how much uncertainty in the statistic, which is largely a function of sample size.

damienfrancois · Answer

The number of data in the class is sometimes referred to as the support of the classifier. It tells how much you can trust your result, like a p-value would allow you to trust or distrust some test.

One approach you can use is to compute several classifier performance measures, not only precision and recall, but also true positive rate, false positive rate, specificity, sensitivity, positive likelihood, negative likelihood, etc. and see whether they are consistent with one another. If one of the measure maxes out (100%) and the other do not, it is often, in my experience, indicative of something went wrong (e.g. poor support, trivial classifier, biased classifier, etc.). See this for a list of classifier performance measures.

Measuring performance of different classifiers with different sample sizes

3 Answers

Add your own answers!

Ask a Question