Data Science Asked by Tuan Do on July 10, 2021
I have a bunch of documents that I want to classify which ones talk about soccer (unsupervised learning, I do not want to manually label the documents).
One way I am thinking about is to go online and search for the most popular words in soccer articles to make a list of vocabularies (for example: score, shoot, World Cup, etc). Then somehow use that list of vocabularies to classify the documents (maybe if a particular contains 30% of the words in that list of vocabularies, then that document talks about soccer).
I am wondering whether it is a valid method or there are better existing methods. Really appreciate any help.
First of all you need to have available a train set, which means that you should annotate manually which document is related to soccer and which not. Then you need to process the available corpus (remove numbers, stop-words etc., stemming) and build a vocabulary. After that you should choose the appropriate feature representation. Each term is a feature and you have to decide how you are going to reprsent each feature, which means what kind of weight you will assingn. One way is the tf-idf representation. Then you will be able to train a classifier.
*The only way to avoid labeling manually the texts is to find some already labeled in the same language.
Answered by Christos Karatsalos on July 10, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP