Data Science Asked by Justin Cunningham on February 27, 2021
I am trying to train a sklearn K-NN classifier on a labeled text dataset (in Irish). There are 5 classes, 0-4, but there is a lot of variation between how many there are in each class.
What I have done is I’ve gotten a corpus of Irish text, iterated through every word and stripped a few letters from it based on a linguistic form it took (or not). The problem is, class 4 (which means no action was performed) accounts for 16.5M out of 20.1M entries and it goes all the way down to class 3 with only 36,000 entries.
Gathering more data probably won’t help as this basically represents the proportion of times these forms of words appear in real life.
Is this bad for classification and will it bias the classifier in any way? If it does, is that bias actually of help?
Any help is appreciated.
Justin
I could think of 2 solutions:
Since you mention stripping of the words why not make it a 2 step program where in the first classifier is a binary where in 1-3 is one class of Actions performed and the second class is 4 where there is No Action performed. If the word happens to be in the first category you can further run it for classification in between the 3 classes.
Would be to cut down 4 to fit the distribution but this will result in a huge loss of data which I dont think is viable but worth a try!
Bias is never good for any program and that is clearly explained by Shiv!
Correct answer by Soumya Kundu on February 27, 2021
Just imagine it practically, If class A data is 90% and class B data is 10% ,then if you just randomly classify the label you prediction as class A then your accuracy will be 90%.
So biased data will lead your model to be biased over the class which has more data as it will give better predictions in your model.
Answered by Shiv on February 27, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP