TransWikia.com

Improving accuracy of Text Classification

Data Science Asked by ac-lap on September 15, 2020

I am working on a text classification problem, the objective is to classify news articles to their corresponding categories, but in this case the categories are not very broad like, politics, sports, economics, etc., but are very closely related and in some cases even partially overlapping. This is a single category classification problem and not multi-class classification. Below is the details of the method I used.

Data Preparation –

  1. Broke the documents in list of words.
  2. Removed stop words, punctuations.
  3. Performed stemming.
  4. Replaced numerical values with ‘#num#’ to reduce vocabulary size.
  5. Transformed the documents into TF-IDF vectors.
  6. Sorted all words based on their TF-IDF value and selected the top 20K words, these will be used a feature list for the classification algorithm.
  7. Used SVM.

I have 4,500 categorized documents with 17 categories, and I used 80:20 ration for training and test dataset. I used Sklearn python library.

The best classification accuracy I have managed to get is 61% and I need it to be at least 85%.

Any help on how I can improve the accuracy would be greatly appreciated. Thanks a lot. Please let me know if you need any more details.

4 Answers

First of all good job done in processing the data and coming up with your base model. I would suggest few things that you can try:

  1. Improve your model my adding bigrams and tri-grams as features.
  2. Try doing some topic modelling like latent Dirichlet allocation or Probabilistic latent Semantic Analysis for the corpus using a specified number of topics - say 20. You would get a vector of 20 probabilities corresponding to the 20 topics for each document. You could use that vector as input for your classification or use it as additional features on top of what you already have from your base model enhanced with bigrams and trigrams.
  3. Another thing I would say is try using a tree based classifier ensemble to capture non-linearity and interactions between the features. Either of random forest or gradient boosting would be fine. In gradient boosting you can use xgboost as its a pretty good package that gives good classification.
  4. If you are familiar with Deep Learning you can give a try with a Recurrent Neural network architecture(mostly the LSTM versions).

Hope this helps and let me know if this improves your classification accuracy.

Answered by Santanu_Pattanayak on September 15, 2020

When considering how to clean the text, we should think about the data problem we are trying to solve. Here are few more step for preprocessing which can improve your features.

1.) Use Good tokenizer(textblob,stanford tokenizer)

2.) Try Lemmatization , stemming always not perform well in case news article.

3.) word segmentation

4.) Normalization (equivalence classing of terms)

For selecting model

1.) In your example above, we classified the document by comparing the number of matching terms in the document vectors. In the real world numerous more complex algorithms exist for classification such as Support Vector Machines (SVMs), Naive Bayes and Decision Trees , Maximum Entropy.

2.) You can think your problem as making clusters of news and getting semantic relationship of source news from these cluster. You can try topic modelling(LDA and LSA) and Doc2vec/word2vec technique for getting vector for document/word and then use these vectors for classification task.

Further if you are confuse to select a appropriate model for problem , you can read from this link Choosing Machine Learning Algorithms: Lessons from Microsoft Azure

Answered by Abhishek Verma on September 15, 2020

Another important tool is regularizing your document/text matrix. If, for example, you have many long documents and many short documents, the long documents are weighted higher in the document/term matrix. If you normalize each document row, so that all document rows add up to 1.0, this gives more equal weighting to the short and long documents. People have created several non-linear functions for document/term matrices that address various problems with the base data.

Here is a long series of posts I once wrote on this topic: https://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html

Answered by Jack Parsons on September 15, 2020

You should try FastText, which is open source library by Facebook research. https://fasttext.cc/docs/en/supervised-tutorial.html

You need to create a file format needed by Fasttext algorithm.

Also following suggestions for cleaning text

  1. Change the case to lower
  2. Remove hyper links
  3. Try to remove typo words

Fasttext automatically converts words into n-grams. So you don't need to worry about. I have got very good results with Fasttext

Try autotune option in Supervised learning.

Good luck

Answered by Karthik Sunil on September 15, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP