Data Science Asked by zxcisnoias on March 16, 2021
I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives.
The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not.
Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers)?
Or is it ok to keep them in the dataset since they only count for 1-5% ?
You can use BERT to create vectors that will capture the context of the whole tweet. Once, you do that, try clustering (K-Means or GMM). You can then look at the clusters found and separate out this unwanted data.
Answered by Abhishek Verma on March 16, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP