How to remove irrelevant text data from a large dataset

Question

I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives.
The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not.
Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers)?
Or is it ok to keep them in the dataset since they only count for 1-5% ?

Abhishek Verma · Answer

You can use BERT to create vectors that will capture the context of the whole tweet. Once, you do that, try clustering (K-Means or GMM). You can then look at the clusters found and separate out this unwanted data.

Answered by Abhishek Verma on March 16, 2021

How to remove irrelevant text data from a large dataset

One Answer

Add your own answers!

Ask a Question