How to measure augmented data quality

Data Science Asked by Mikhail_Sam on September 6, 2020

I work on NLP binary classification task (but actually the question can be applied to any ML task using augmentation) and use Augmentation technique for creating additional data.
I already have trained model and new small dataset (model was trained on another one!).
So I create additional dataset using augmentation.

I have several functions/libraries/approaches to augment data.
My question is how to understand are new data (augmented) good or not at stage of creating new data (i.e. without retraining the model)?

At the current moment I have next idea:

Get from my small dataset examples where model falls (False negative/false positive) ->
Augment data on them -> Feed to model -> Look at the scores.
If model still fails on them – data is good enough.

But at all I’m not sure if this is a reliable approach.

Are there are some acknowledged metrics for that?
Or maybe someone can suggest ways to do this correctly?

data augmentation dataset metric nlp

Add your own answers!

Ask a Question

Get help from others!

Recent Answers

Lex on Does Google Analytics track 404 page responses as valid page views?
Peter Machado on Why fry rice before boiling?
haakon.io on Why fry rice before boiling?
Jon Church on Why fry rice before boiling?
Joshua Engel on Why fry rice before boiling?