Is there a fundamental difference from creating a model for each value in a category?

Question

I am creating a few models based on service requests. The services being requested are not distributed equally, some services being used sparingly, whereas others are quite common.
I had these services as categorical variables and built pipelines to incorporate them through one-hot encoding. I got to thinking that it may make more sense to train a model per service(at least for the common ones). Or does it make more sense to lump in the less common ones in a special category?
I am struggling with the regression model, coming in at 0.41 for my R2 value.

Erwan · Accepted Answer

Yes there is.
If a model is trained for each specific value of a variable (a category), then only the subset of data for this category can be used to train and test the model. As a consequence each model has a smaller number of instances to be trained from. Consequences:

In the case of a small category, there might not be enough instances to obtain a reliable model.
Every model is independent. This can be good or bad depending on whether this independence is also true in the data or not, or to what extent:

If the features behave in a completely different way depending on the category, then it's better to create individual models since each can really exploit the specific patterns for this category.
If the features have a very similar behavior across the categories, then independent models by category would potentially lose a lot of information.

In conclusion the choice often depends on:

How much data is available for each category.
How independent are the other features with respect to the category.

Is there a fundamental difference from creating a model for each value in a category?

One Answer

Add your own answers!

Ask a Question