TransWikia.com

Is there any reason not to use classification instead of regression when looking for ranges?

Data Science Asked by etiennedm on January 16, 2021

In the case I want to predict only ranges from a continuous value, is there any reason to use regression instead of classification ? Could it depend on the type of model I am using (neural network, decision tree, bayesian, …) ?

Example

Let say I have a dataset with images. Each image has one human on it and is labeled with his/her height. Now I am only interested in predicting height ranges, for instance these four classes [ A, B, C, D ] = [ <150, 150-170, 170-190, >190 ] (in cm). Is there any reason why one of the two following approaches should lead to better performances ?

  • case 1: using regression – First create and fit a model that predicts the exact height from an image, then simply gives its associated height range.
  • case 2: using classification – First label all the images with the wanted ranges (=classes), then create and fit a classifier to predict this height range.

Note: I am wondering if there is a general answer to this question, not only to this example

EDIT

As @n1tk pointed out, in the post Performance of CNN based deep models with number of classes, the question is answered if we think about increasing the number of classes. In my question, I am wondering about regression vs classification. So try to fit a continuous value vs ranges from this value.

One Answer

The general answer is how the model will be used deal. Either way may be optimal for the case.

For example - If the model groups applicants into good credit risk and bad credit risk, that might be fine to say model score > x = good risk and model score <= x = bad risk. But maybe there will be differential action based on the model score - like giving a different interest rate or a bigger loan.

In the original example, in regression actual = 191, predicted = 189 you can calculate the loss.

In classification, if actual = 191 and P(>190)=0.35, P(170-190)=0.40, P(150-170)=0.25, then you just know the wrong class. Is that enough for the model usage?

There is also the assumption that a "closer" class will be chosen but that might not be true, e.g actual=191, P(>190)=0.25, P(170-190)=0.25, P(150-170)=0.5. The regression could come up with 160 also but you can measure that loss if the model usage requires it. Many classification algorithms do not know if classes are "close" - Confusion matrix. "How close I am to the diagonal?". Is there such metric?

You can also look at Ordinal Regression https://en.wikipedia.org/wiki/Ordinal_regression. In this case there is an implicit ranking in the "class".

Choose based on how the model will be used. Always important to know the usage and the problem being solved, then work backwards to the model.

Hope that helps.

Answered by Craig on January 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP