Dealing with diverse groups in regression

Question

What happens if a certain dataset contains different "groups" that follow different linear models?

For example, let's imagine that examining the scatterplot of a certain feature $x_i$ against $y$ we can see that some points follow a linear relationship with a coefficient $beta_A<0$ while other points clearly have $beta_B>0$. We can infer that these points belong to two different populations, population $A$ responds negatively to high values of feature $x_i$ while population $B$ responds positively. We then create a categorical feature (or one hot encoding) to show which population each row belongs to.

Is splitting the dataset required or are commonly used algorithms able to recognize the different relations between features from different categorical variables?

user2974951 · Answer

You can't really do that, there may be some factor which binds certain "groups" of data together, but there are many reasons for this. Your relationship may be nonlinear, or the "groups" of data may represent subjects / objects, where a stronger correlation exists. Unless you know for a fact that these points belong to different populations you shouldn't do that, use the data that you have to model these groupings.

Answered by user2974951 on September 17, 2020

Johannes · Answer

For the case of unobservable groups, you could use mixture models, in your case a mixture of linear regression models. Mixture models identify latent (=unobserved) clusters in the data so that each cluster has the same parameters in the consequent part of the model. The text book example are mixed Gaussians, where each individual observation comes from a Normal distribution, but the mean is different for each group. 
In your case, a mixture model would infer clusters of individuals that share regression coefficients and estimate the coefficients for each cluster in one step.

For a basic introduction, see 
Grün, B., & Leisch, F. (2008). Finite mixtures of generalized linear regression models. Recent advances in linear models and related areas (pp. 205-230). Physica-Verlag HD (link)

Finite mixture models require the number of latent groups to be specified (e.g. domain knowledge or cross-validation). Infinite mixture models find a good number of groups from the data.

These models typically do not give you clear rules as to why an individual belongs to a cluster and consequently cannot be used for unknown individuals, but could possibly be extended by a prior that explicitly models cluster probabilities based on observed data.

Dealing with diverse groups in regression

2 Answers

Add your own answers!

Ask a Question