Training samples with no labels: To include or not to include?

Question

I am working on a multi-label classification problem. Each sample is capable of taking more than a single label. Sometimes samples don't have any labels associated with them.
My dataset has 50% samples with 1 or more labels associated with them. The remaining have no labels at all. I am sure, among the future "test" samples, there will be a population that has no labels attached.
So far, I've been dropping the 50% samples with no labels and training a multilabel classifier. Recently, I realized that this model will end up predicting labels for a sample even when none of the labels seem appropriate for it. This leaves me with 2 options -

Add a new label called "NONE", which is equal to 1 for samples with no labels and 0 for label-annotated samples.
Simply train the multilabel classifier on all the standard labels. Let the model figure out on its own which combination of features qualify for no labels at all.

Am I thinking in the right direction? I'd also like to know your suggestions on this problem.

lhsmello · Accepted Answer

Let $n$ be the number of distinct labels.
The problem with your first proposed solution is that your multi-label method now has to learn that label "NONE" never co-occurs with other labels. If the multi-label method assumes nothing about the distribution of labels, then it has to learn that all $2^n-1$ combinations of labels in which "NONE"=1 and at least one other label is 1, never occurs.
It also, does not prevents predicting all zeroes.
As your problem has a lot of samples with no labels at all, a simple and effective solution is to build your own hierarchical classifier.
Make two classifiers: The first one is a binary classifier that just detects if all labels are zero or not. To train this binary classifier, just transform your samples with no labels to label "A" and all other labels to label "B".
That is, a "A" from this binary classifier means no labels at all and a "B" means that exists at least one label.
The second classifier is any multi-label classifier you want, but only trained on samples with at least one label. In prediction/test phase, this second classifier is only called if the first binary classifier predicts label "B" (at least one label).
Details on more elaborated hierarchical classifiers can be found in:
https://www.researchgate.net/publication/306040749_Consistency_of_Probabilistic_Classifier_Trees
Other common solutions are using one of this four multi-label methods in combination with a multi-class classifier (ex: K-nn and SVM): Binary Relevance, Classifier Chain and Label Powerset. Scikit-learn implement this methods.
I suggest Classifier Chain, which take dependencies among labels into account, as it seems from your question that you want the algorithm to predict pretty well when there is no labels at all.
Label Powerset is also a good solution, except if you have a "lot" of labels ($ngeq 20$) and not sufficient data.

Training samples with no labels: To include or not to include?

One Answer

Add your own answers!

Ask a Question