Data Science Asked by hakiki_makato on February 1, 2021
I understand how TF-IDF "score" is calculated for each word in a document, but I do not get how can it be used to classify a test document. For example, if the word "Mobile" occurs in two texts, in the training data, one about Business (like the selling of Mobiles) and the other about Tech, then how does the "score" for word "Mobile", in both training and test document over the given dataset, help the algorithm to classify whether the text (a new test document) belongs to "Business" category or "Tech" category? I’m new to NLP, thanks in advance!
It's not a single TFIDF score on its own which makes classification possible, the TFIDF scores are used inside a vector to represent a full document: for every single word $w_i$ in the vocabulary, the $i$th value in the vector contains the corresponding TFIDF score. By using this representation for every document in a collection (the same index always corresponds to the same word), one obtains a big set of vectors (instances), each containing $N$ TFIDF scores (features).
Assuming we have some training data (labelled documents), we can use any supervised method to learn a model, for instance Naive Bayes, Decision Trees, SVM, etc. These algorithms differ from each other but they are all able to take into account all the features for a document in order to predict a label. So in the example you give maybe the word "mobile" only helps the algorithm eliminate the categories "sports" and "literature", but maybe some other words (or absence of other words) is going to help the algorithm decide between categories "Business" and "Tech".
Answered by Erwan on February 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP