Data Science Asked by ac-lap on September 15, 2020
I am working on a text classification problem, the objective is to classify news articles to their corresponding categories, but in this case the categories are not very broad like, politics, sports, economics, etc., but are very closely related and in some cases even partially overlapping. This is a single category classification problem and not multi-class classification. Below is the details of the method I used.
Data Preparation –
I have 4,500 categorized documents with 17 categories, and I used 80:20 ration for training and test dataset. I used Sklearn python library.
The best classification accuracy I have managed to get is 61% and I need it to be at least 85%.
Any help on how I can improve the accuracy would be greatly appreciated. Thanks a lot. Please let me know if you need any more details.
First of all good job done in processing the data and coming up with your base model. I would suggest few things that you can try:
Hope this helps and let me know if this improves your classification accuracy.
Answered by Santanu_Pattanayak on September 15, 2020
When considering how to clean the text, we should think about the data problem we are trying to solve. Here are few more step for preprocessing which can improve your features.
1.) Use Good tokenizer(textblob,stanford tokenizer)
2.) Try Lemmatization , stemming always not perform well in case news article.
3.) word segmentation
4.) Normalization (equivalence classing of terms)
For selecting model
1.) In your example above, we classified the document by comparing the number of matching terms in the document vectors. In the real world numerous more complex algorithms exist for classification such as Support Vector Machines (SVMs), Naive Bayes and Decision Trees , Maximum Entropy.
2.) You can think your problem as making clusters of news and getting semantic relationship of source news from these cluster. You can try topic modelling(LDA and LSA) and Doc2vec/word2vec technique for getting vector for document/word and then use these vectors for classification task.
Further if you are confuse to select a appropriate model for problem , you can read from this link Choosing Machine Learning Algorithms: Lessons from Microsoft Azure
Answered by Abhishek Verma on September 15, 2020
Another important tool is regularizing your document/text matrix. If, for example, you have many long documents and many short documents, the long documents are weighted higher in the document/term matrix. If you normalize each document row, so that all document rows add up to 1.0, this gives more equal weighting to the short and long documents. People have created several non-linear functions for document/term matrices that address various problems with the base data.
Here is a long series of posts I once wrote on this topic: https://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html
Answered by Jack Parsons on September 15, 2020
You should try FastText, which is open source library by Facebook research. https://fasttext.cc/docs/en/supervised-tutorial.html
You need to create a file format needed by Fasttext algorithm.
Also following suggestions for cleaning text
Fasttext automatically converts words into n-grams. So you don't need to worry about. I have got very good results with Fasttext
Try autotune option in Supervised learning.
Good luck
Answered by Karthik Sunil on September 15, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP