Data Science Asked by Boggavarapu Ram Saran Sai Srin on February 15, 2021
I know that outliers are present in data but their behaviour varies a lot from remaining data points. But today while learning about naive-Bayes they mentioned that naive-Bayes can affected by the outliers. But which points in data set are termed as outliers and how do we identify them?
In my view, the words which are not seen in training data can be considered as outliers as it leads to the probability of the word to zero, in case of naive bayes.
Also,I think too frequent and too rare words in the corpus can also be considered as outliers as they effect the model.
Answered by nag on February 15, 2021
I define an outlier in the following ways.
Answered by Darshan Jain on February 15, 2021
Since most models are built using pre-trained embeddings the problem of outliers in textual data is not that prominent. This is because the training is done on millions of words/sentences and outliers if any do not have an effect.
Coming to the specific problem statement, outliers in textual data could mean many things. For e.g assuming you are collating all news articles related to 'tech'. Now if there is a 'health' article in that corpus, then this is an outlier. Credit card fraud detection is another area where we train models to detect outliers in textual data.
The typical way to identify these outliers is via clustering. The techniques vary mildly from paper to paper
Answered by Allohvk on February 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP