Data Science Asked on July 17, 2021
I have developed a pipeline for text data preprocessing with different clean up techniques like Stemming , Lemmatization, Stop words removal etc. But now the ask from the business team is to quantify the quality of the preprocessing steps (or, the text data it produced). How can we develop some metrics to evaluate the preprocessing quality of text data?
Evaluating any task consists in defining the task formally so that there is a way to define what is a correct output as objectively as possible. For example a good Machine Translation system produces a good translation if it has the same meaning as the input sentence and is grammatically correct in the target language.
Assuming that this task of preprocessing is formally defined, then the evaluation should measure how "correctly preprocessed" is the output:
Usually one would build a test set, manually add the correct output and then compare the system output against this gold standard.
However "preprocessing" is generally not considered a task by itself, because by definition it's a step for another task. Importantly, the steps of preprocessing depend on the other task, they are not always the same. For example stop words removal makes sense only for tasks based on distributional semantics, i.e. related to the topic. Preprocessing may also include steps which depend on the volume of data.
Answered by Erwan on July 17, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP