Quality check for preprocessing of Text data

Question

I have developed a pipeline for text data preprocessing with different clean up techniques like Stemming , Lemmatization, Stop words removal etc. But now the ask from the business team is to quantify the quality of the preprocessing steps (or, the text data it produced). How can we develop some metrics to evaluate the preprocessing quality of text data?

Erwan · Answer

Evaluating any task consists in defining the task formally so that there is a way to define what is a correct output as objectively as possible. For example a good Machine Translation system produces a good translation if it has the same meaning as the input sentence and is grammatically correct in the target language.
Assuming that this task of preprocessing is formally defined, then the evaluation should measure how "correctly preprocessed" is the output:

are the stem and lemma always correct?
are the stop words and only the stop words removed?
Etc.

Usually one would build a test set, manually add the correct output and then compare the system output against this gold standard.
However "preprocessing" is generally not considered a task by itself, because by definition it's a step for another task. Importantly, the steps of preprocessing depend on the other task, they are not always the same. For example stop words removal makes sense only for tasks based on distributional semantics, i.e. related to the topic. Preprocessing may also include steps which depend on the volume of data.

Quality check for preprocessing of Text data

One Answer

Add your own answers!

Ask a Question