What is meant by outliers in text data set. How to detect them?

Question

I know that outliers are present in data but their behaviour varies a lot from remaining data points. But today while learning about naive-Bayes  they mentioned that naive-Bayes can affected by the outliers. But which points in data set are termed as outliers and how do we identify them?

nag · Answer

In my view, the words which are not seen in training data can be considered as outliers as it leads to the probability of the word to zero, in case of naive bayes.

Also,I think too frequent and too rare words in the corpus can also be considered as outliers as they effect the model.

Darshan Jain · Answer

I define an outlier in the following ways.

It can be a wrong data entry (Eg. human typing error)
It can be a data that has values that are not relevant (Eg. an entry of total which is calculated as the sum of the above columns. This data can be misleading at times so it should be removed)
It can be a data entry that is all or most fields blank (Eg. a row in the data where all fields are blank. This row maybe not contributing anything to the analysis)
It may be extreme values which fall way out of the range of the other data (Eg. when we are calculating the age of Humans, anyone with age (say) above 120 years is an extreme case and that can be ignored depending upon our analysis goal)

Allohvk · Answer

Since most models are built using pre-trained embeddings the problem of outliers in textual data is not that prominent. This is because the training is done on millions of words/sentences and outliers if any do not have an effect.
Coming to the specific problem statement, outliers in textual data could mean many things. For e.g assuming you are collating all news articles related to 'tech'. Now if there is a 'health' article in that corpus, then this is an outlier. Credit card fraud detection is another area where we train models to detect outliers in textual data.
The typical way to identify these outliers is via clustering. The techniques vary mildly from paper to paper

What is meant by outliers in text data set. How to detect them?

3 Answers

Add your own answers!

Ask a Question