TransWikia.com

target variable prediction among possible answers

Data Science Asked by davide m. on January 27, 2021

I have a dataset on which I would like to apply a Machine Learning algorithm for multi-class classification. An example of my target variable (in string format, will be later OneHotEncode-d):

|       target      |
+-------------------+
|  (Mike, Michael)  |
| (Anne, Annabelle) |
|    (Joe, Joseph)  |
|    (Mark, )       |
|    (Aaron, )      |
+-------------------+

I want to have a model that, for example, predicts Joseph but does not consider it wrong because it is one of the two possible answers.
Is this a multi-class or multi-label problem? Can the two definitions go alongside or is it one or the other? I am a bit confused, though I propend towards the latter.

In case my guess was correct, would it be sufficient to just pick one of the models that supports multi-label classification, fit() it and score() it?

One Answer

I want to have a model that, for example, predicts Joseph but does not consider it wrong because it is one of the two possible answers.

It seems that you are confusing prediction and evaluation: whether an answer is correct or not is a matter of evaluation, and the evaluation method doesn't have to follow the same rules as the predicting system.

For example the system can be designed to always predict a single name, and the evaluation measure can be based on whether the predicted name belongs to a set of correct answers or not. This would technically be simple multiclass classification, since there's a single class predicted. However you would have to implement your own evaluation method, because the standard one will consider correct only if the prediction is exactly the true class.

You could also decide that the system can predict any combination of names, that would be multi-label classification. But here again you could decide any specific evaluation method, for instance count a prediction as correct as long as it contains at least one of the true answers, or contains only names which belong to the true answer, etc.

Note that in theory any multi-label problem can be transformed into a multiclass problem simply by considering any possible set of classes as a single class (e.g. "Mike+Michael" as one class). In general it's a matter of number of values, since if every possible combination can happen in the data there would be $2^N$ classes as a result. I don't think this is relevant in your case, since this would defeat the idea of a flexible evaluation method.

Answered by Erwan on January 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP