Data Science Asked by qualia universe on November 12, 2020
When I design a document classifier using traditional feature engineering, I would prefer (to Boolean model) tf-idf model to represent a document into a vector because intuitively Boolean model loses information of how important each word is for classifying a document into certain class.
I mean using Boolean model for representing a document as a vector is to give it a less meaningful position in n-dimensional vector space than tf-idf-based feature extraction when each dimension represents a term, by using discrete value rather than a continuous value, since discrete(0 or 1) value is made to ignore the difference of weight of each term although parameter tuning process may optimize coefficient of each term when using linear combination for document classification.
Am I justified in my thinking that using Boolean feature for bag-of-words model to extract feature vector from a document is not a good choice for the above-mentioned reason?
I already know the recent approach like representation learning and dimensional reduction like word embedding or BERT language model. My question is limited to some traditional feature extraction from document data.
Your reasoning is correct: for most tasks related to information retrieval and/or document classification based on the semantics of the documents, it's recommended to take into account the importance of the terms (both inside the document and across all documents, hence TF and IDF).
However TF-IDF is not necessarily always the best choice:
Answered by Erwan on November 12, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP