How to estimate the accuracy on a large dataset?

Question

Given that I have a deep learning model(handover from former colleague).
For some reason, the train/dev set was missing.

In my situation, I want to classify my dataset into 100 categories.
The dataset is extremely imbalanced.
The dataset size is about tens of millions

First of all, I run the model and got the prediction on the whole dataset.

Then, I sample 100 records per category(according to the prediction) and got a 10,000 test set.

Next, I labeled the ground truth of each record for the test set and calculate the precision, recall, f1 for each category and got F1-micro and F1-macro.

How to estimate the accuracy or other metrics on the whole dataset? Is it correct that I use the weighted sum of each category's precision(the weight is the proportion of prediction on the whole) to estimate?

Brian Spiering · Answer

Accuracy has a specific meaning classification -  the data points with predicted labels must exactly match actual labels over the total number of data points.

In order to calculate accuracy, you need the actual labels for each data point. If you do not have actual labels for a data point, those data points can not be used in the analysis.

How to estimate the accuracy on a large dataset?

One Answer

Add your own answers!

Ask a Question