Text-Classification-Problem, what is the right approach?

Question

I'm planing to write a classification program that is able to classify unknown text in around 10 different categories and if none of them fits it would be nice to know that. It is also possible that more then one category is right.

My predefined categories are:

c1 = "politics"
c2 = "biology"
c3 = "food"
...

I'm thinking about the right approach in how to represent my training-data or what kind of classification is the right one. The first challenge is about finding the right features. If I only have text (250 words each) what method would you recommend to find the right features? My first approach is to remove all stop-words and use the POS-Tagger (Stanford NLP POS-Tagger) to find nouns, adjective etc. I count them an use all frequently appeared words as features.

e.g. politics, I've around 2.000 text-entities. With the mentioned POS-Tagger I found:

law:           841
capitalism:    412
president:     397
democracy:     1007
executive:     112
...

Would it be right to use only that as features? The trainings-set would then look like:

Training set for politics:
feature law         numeric
feature capitalism  numeric
feature president   numeric
feature democracy   numeric
feature executive   numeric
class politics,all_others

sample data:
politics,5,7,1,9,3
politics,14,4,6,7,9
politics,9,9,9,4,2,1
politics,5,8,0,7,6
...
all_others,0,2,4,1,0
all_others,0,0,1,1,1
all_others,7,4,0,0,0
...

Would this be a right approach for binary-classification? Or how would I define my sets? Or is multi-class classification the right approach? Then it would look like:

Training set for politics:
feature law         numeric
feature capitalism  numeric
feature president   numeric
feature democracy   numeric
feature executive   numeric
feature genetics    numeric
feature muscle      numeric
feature blood       numeric
feature burger      numeric
feature salad       numeric
feature cooking     numeric 
class politics,biology,food

sample data:
politics,5,7,1,9,3,0,0,2,1,0,1
politics,14,4,6,7,9,0,0,0,0,0,1
politics,9,9,9,4,2,1,1,1,1,0,3
politics,5,8,0,7,6,2,2,0,1,0,1
...
biology,0,2,4,1,0,4,19,5,0,2,2
biology,0,0,1,1,1,12,9,9,2,1,1
biology,7,4,0,0,0,10,10,3,0,0,7
...

What would you say?

kylerthecreator · Accepted Answer

I think perhaps the first thing to decide that will help clarify some of your other questions is whether you want to perform binary classification or multi-class classification. If you're interested in classifying each instance in your dataset into more than one class, then this brings up a set of new concerns regarding setting up your data set, the experiments you want to run, and how you plan to evaluate your classifier(s). My hunch is that you could formulate your task as a binary one where you train and test one classifier for each class you want to predict, and simply set up the data matrix so that there are two classes to predict - (1) the one you're interested in classifying and (2) everything else.

In that case, instead of your training set looking like this (where each row is a document and columns 1-3 contain features for that document, and the class column is the class to be predicted):

1           2           3           class
feature1    feature2    feature3    politics
feature1    feature2    feature3    law
feature1    feature2    feature3    president
feature1    feature2    feature3    politics

it would look like the following in the case where you're interested in detecting the politics class against everything else:

1           2           3           class
feature1    feature2    feature3    politics
feature1    feature2    feature3    non-politics
feature1    feature2    feature3    non-politics
feature1    feature2    feature3    politics

You would need to do this process for each class you're interested in predicting, and then train and test one classifier per class and evaluate each classifier according to your chosen metrics (usually accuracy, precision, or recall or some variation thereof).

As far as choosing features, this requires quite a bit of thinking. Features can be highly dependent on the type of text you're trying to classify, so be sure to explore your dataset and get a sense for how people are writing in each domain. Qualitative investigation isn't enough to decide once and for all what are good features, but it is a good way to get ideas. Also, look into TF-IDF weighting of terms instead of just using their frequency within each instance of your dataset. This will help you pick up on (a) terms that are prevalent within a document (and possibly a target class) and (b) terms that distinguish a given document from other documents. I hope this helps a little.

Aleksandr Blekh · Answer

The following great article by Sebastian Raschka on Bayesian approach to text classification should be very helpful for your task. I also highly recommend his excellent blog on data science topics, as an additional general reference.
You may also check this educational report on text classification. It might provide you with some additional ideas.

a. d. · Answer

You should probably start with a very basic approach: bag-of-words representation (vector as long as your vocabulary, 1 if the word is found in the text, 0 if it's not), and a simple classifier like naive bayes. This works surprisingly well to find topics (a little less for sentiment classification).
For preprocessing you would probably want to do stop-word removal and stemming (in order to reduce the vocabulary) rather than POS tagging.

The problem with the basic approach is that you would have a n-class classifier, and no "this fits multiple categories" or "this fits 0 categories" answers. If you want to include that aspect, then the best is to design n 2-class classifiers, one for each of your classes, where each classifier decides whether the text fits the class or not.

But I would try out-of-the box naive bayes first, just to see how it works. You can use Weka, it's free, open-source, and can be integrated with java. You can also do the preprocessing (stemming) with the Python NLTK.

Text-Classification-Problem, what is the right approach?

3 Answers

Add your own answers!

Ask a Question