Data Science Asked on September 5, 2021
Consider the multilabel problem when asking "does the sample belong to this class" with, for example, a movie label dataset where almost every movie is labelled "drama" because of this label’s vagueness. It also has a few rare labels such as "French late 60s noir".
To both boost recall, as well as maintain compatibility with most classifiers, we can both resample and train a binary classifier for each label individually (drama: yes/no, noir: yes/no).
But while "French late 60s noir" might be a 0.05:0.95 imbalance that requires serious effort for the model to guess, the "drama" category is so widely applied that the problem is reversed, i.e. 0.95:0.05.
How should one logically best proceed to maintain recall over precision, now that the "target class" is 95% of the labels (kind of breaks with the definition)? Upsample the non-target class? Downsample the target class? Reframe the problem as "not drama" to maintain underbalance for every label?
I don't think there's any obviously best option. I'd suggest trying a few reasonable solutions, evaluate on a development set and then pick the one which performs best. Don't forget to try with the original data as it is, resampling doesn't always work better.
Upsample the non-target class? Downsample the target class?
These two options are very likely to give the same result (assuming you use the same proportion).
Reframe the problem as "not drama" to maintain underbalance for every label?
For the classifier it would be exactly the same.
Answered by Erwan on September 5, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP