Data Science Asked by Lothbrok on May 25, 2021
I’m planing to write a classification program that is able to classify unknown text in around 10 different categories and if none of them fits it would be nice to know that. It is also possible that more then one category is right.
My predefined categories are:
c1 = "politics"
c2 = "biology"
c3 = "food"
...
I’m thinking about the right approach in how to represent my training-data or what kind of classification is the right one. The first challenge is about finding the right features. If I only have text (250 words each) what method would you recommend to find the right features? My first approach is to remove all stop-words and use the POS-Tagger (Stanford NLP POS-Tagger) to find nouns, adjective etc. I count them an use all frequently appeared words as features.
e.g. politics, I’ve around 2.000 text-entities. With the mentioned POS-Tagger I found:
law: 841
capitalism: 412
president: 397
democracy: 1007
executive: 112
...
Would it be right to use only that as features? The trainings-set would then look like:
Training set for politics:
feature law numeric
feature capitalism numeric
feature president numeric
feature democracy numeric
feature executive numeric
class politics,all_others
sample data:
politics,5,7,1,9,3
politics,14,4,6,7,9
politics,9,9,9,4,2,1
politics,5,8,0,7,6
...
all_others,0,2,4,1,0
all_others,0,0,1,1,1
all_others,7,4,0,0,0
...
Would this be a right approach for binary-classification? Or how would I define my sets? Or is multi-class classification the right approach? Then it would look like:
Training set for politics:
feature law numeric
feature capitalism numeric
feature president numeric
feature democracy numeric
feature executive numeric
feature genetics numeric
feature muscle numeric
feature blood numeric
feature burger numeric
feature salad numeric
feature cooking numeric
class politics,biology,food
sample data:
politics,5,7,1,9,3,0,0,2,1,0,1
politics,14,4,6,7,9,0,0,0,0,0,1
politics,9,9,9,4,2,1,1,1,1,0,3
politics,5,8,0,7,6,2,2,0,1,0,1
...
biology,0,2,4,1,0,4,19,5,0,2,2
biology,0,0,1,1,1,12,9,9,2,1,1
biology,7,4,0,0,0,10,10,3,0,0,7
...
What would you say?
I think perhaps the first thing to decide that will help clarify some of your other questions is whether you want to perform binary classification or multi-class classification. If you're interested in classifying each instance in your dataset into more than one class, then this brings up a set of new concerns regarding setting up your data set, the experiments you want to run, and how you plan to evaluate your classifier(s). My hunch is that you could formulate your task as a binary one where you train and test one classifier for each class you want to predict, and simply set up the data matrix so that there are two classes to predict - (1) the one you're interested in classifying and (2) everything else.
In that case, instead of your training set looking like this (where each row is a document and columns 1-3 contain features for that document, and the class column is the class to be predicted):
1 2 3 class
feature1 feature2 feature3 politics
feature1 feature2 feature3 law
feature1 feature2 feature3 president
feature1 feature2 feature3 politics
it would look like the following in the case where you're interested in detecting the politics class against everything else:
1 2 3 class
feature1 feature2 feature3 politics
feature1 feature2 feature3 non-politics
feature1 feature2 feature3 non-politics
feature1 feature2 feature3 politics
You would need to do this process for each class you're interested in predicting, and then train and test one classifier per class and evaluate each classifier according to your chosen metrics (usually accuracy, precision, or recall or some variation thereof).
As far as choosing features, this requires quite a bit of thinking. Features can be highly dependent on the type of text you're trying to classify, so be sure to explore your dataset and get a sense for how people are writing in each domain. Qualitative investigation isn't enough to decide once and for all what are good features, but it is a good way to get ideas. Also, look into TF-IDF weighting of terms instead of just using their frequency within each instance of your dataset. This will help you pick up on (a) terms that are prevalent within a document (and possibly a target class) and (b) terms that distinguish a given document from other documents. I hope this helps a little.
Correct answer by kylerthecreator on May 25, 2021
The following great article by Sebastian Raschka on Bayesian approach to text classification should be very helpful for your task. I also highly recommend his excellent blog on data science topics, as an additional general reference.
You may also check this educational report on text classification. It might provide you with some additional ideas.
Answered by Aleksandr Blekh on May 25, 2021
You should probably start with a very basic approach: bag-of-words representation (vector as long as your vocabulary, 1 if the word is found in the text, 0 if it's not), and a simple classifier like naive bayes. This works surprisingly well to find topics (a little less for sentiment classification). For preprocessing you would probably want to do stop-word removal and stemming (in order to reduce the vocabulary) rather than POS tagging.
The problem with the basic approach is that you would have a n-class classifier, and no "this fits multiple categories" or "this fits 0 categories" answers. If you want to include that aspect, then the best is to design n 2-class classifiers, one for each of your classes, where each classifier decides whether the text fits the class or not.
But I would try out-of-the box naive bayes first, just to see how it works. You can use Weka, it's free, open-source, and can be integrated with java. You can also do the preprocessing (stemming) with the Python NLTK.
Answered by a. d. on May 25, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP