TransWikia.com

What is meant by outliers in text data set. How to detect them?

Data Science Asked by Boggavarapu Ram Saran Sai Srin on February 15, 2021

I know that outliers are present in data but their behaviour varies a lot from remaining data points. But today while learning about naive-Bayes they mentioned that naive-Bayes can affected by the outliers. But which points in data set are termed as outliers and how do we identify them?

3 Answers

In my view, the words which are not seen in training data can be considered as outliers as it leads to the probability of the word to zero, in case of naive bayes.

Also,I think too frequent and too rare words in the corpus can also be considered as outliers as they effect the model.

Answered by nag on February 15, 2021

I define an outlier in the following ways.

  1. It can be a wrong data entry (Eg. human typing error)
  2. It can be a data that has values that are not relevant (Eg. an entry of total which is calculated as the sum of the above columns. This data can be misleading at times so it should be removed)
  3. It can be a data entry that is all or most fields blank (Eg. a row in the data where all fields are blank. This row maybe not contributing anything to the analysis)
  4. It may be extreme values which fall way out of the range of the other data (Eg. when we are calculating the age of Humans, anyone with age (say) above 120 years is an extreme case and that can be ignored depending upon our analysis goal)

Answered by Darshan Jain on February 15, 2021

Since most models are built using pre-trained embeddings the problem of outliers in textual data is not that prominent. This is because the training is done on millions of words/sentences and outliers if any do not have an effect.

Coming to the specific problem statement, outliers in textual data could mean many things. For e.g assuming you are collating all news articles related to 'tech'. Now if there is a 'health' article in that corpus, then this is an outlier. Credit card fraud detection is another area where we train models to detect outliers in textual data.

The typical way to identify these outliers is via clustering. The techniques vary mildly from paper to paper

Answered by Allohvk on February 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP