Data Science Asked by user79322 on October 15, 2020
I am working on a problem where I need to predict the text corresponding to another text in my training data file. For example: if I have value like the software in one of my columns and another corresponding column holds a value adobe pdf for it then my algorithm should be able to predict the same for my test data as well. For example, if my test data has Tableau then the predicted category should be software corresponding to it. here are data options like Software, Tax, etc and corresponding to them there are subcategories. My job is to predict the right sub-category against the primary category in my test data.
What should I do to increase my accuracy? Is n_ngram affecting the accuracy a lot? I could see some but not a lot.
from sklearn.featureextraction.text import CountVectorizer from sklearn.featureextraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
#preparing the final pipeline using the selected parameters
model = Pipeline([('vectorizer', CountVectorizer(ngramrange=(1,4))), ('tfidf', TfidfTransformer(useidf=True)),
('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])
I have already used SVM and linear SVC for doing the classification but my accuracy is only 78%. My training data set is in German with no stop words as these are categories and not plain long text.
A general good approach for high dimensional text data would be to use a word embedding and neural networks. If you want to stick to the models and approach you are using, I suggest you don't over-engineer your dataset and first try with simple bag of word approaches and while cross-validating try different n-grams, pos-tagging but I suspect the very high dimensional feature space is bringing your accuracy score down.
Answered by Iordanis on October 15, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP