is SST=SSE+SSR only in the context of linear regression?

Question

the problem of regression is to minimize the sum of squared errors, i.e. $sumlimits_{i=1}^n (y_i - hat{y}_i)^2 = 0$ .
But only in linear regression could you use the expression $hat{y}_i = beta_0 + beta_1.x_i$ , then minimize the sum of squared errors w.r.t. $beta_0$ and $beta_1$ to obtain the following constraints:
$ begin{align*} sumlimits_{i=1}^n x_i(y_i - hat{y}_i) &= 0  sumlimits_{i=1}^n hat{y}_i(y_i - hat{y}_i) &= 0 end{align*} $
and then these constraints are used to prove that the quantity $sumlimits_{i=1}^n (hat{y}_i - bar{y})(y_i - hat{y}_i) =0$, which yields SST=SSE+SSR.
So, my question is, if suppose regression wasn't linear, would:

SST=SSE+SSR still hold? If yes, why? If no, why?
If the answer to the above question is a no, then a follow-on question I would like to pose is: why is $R^2$ the go-to measure for assessing the regression performance of a model or a technique?

Robert Long · Answer

if suppose regression wasn't linear, would SST=SSE+SSR still hold? If yes, why? If no, why?

Just to be clear that with linear regression it is perfectly OK to model nonlinear associations such as $y = 2x + 3x^2 + 17log(x)$ simply by including the relevant nonlinear terms, because it would still be linear in the parameters. I guess you are aware of this, but just wanted to make sure. In those cases, SST=SSE+SSR will hold.
Now, the crux of the matter is that SST=SSE+SSR is actually a special case that only holds when the model is linear in the parameters.  When we are dealing with a nonlinear model such as logistic regression, or any Generalised Linear Model, the situation is quite different because we model the linear predictor's association with the response variable via a link function so that a simple sum of squared deviations does not meaningfully reflect the variability in the response because the variance of an individual response depends on its mean.
There is a very good further explanation of this on Cross Validated, by Ben, and so I will just post it here, for completeness:

The sums-of-squares in linear regression are special cases of the more general deviance values in the generalised linear model.  In the more general model there is a response distribution with mean linked to a linear function of the explanatory variables (with an intercept term).  The three deviance statistics in a GLM are defined as:
$$begin{matrix}
text{Null Deviance} quad quad text{ } text{ } & & text{ } D_{TOT} = 2(hat{ell}_{S} - hat{ell}_0), [6pt]
text{Explained Deviance} & & D_{REG} = 2(hat{ell}_{p} - hat{ell}_0), [6pt]
text{Residual Deviance}^dagger text{ } & & text{ } D_{RES} = 2(hat{ell}_{S} - hat{ell}_{p}). [6pt]
end{matrix}$$
In these expressions the value $hat{ell}_S$ is the maximised log-likelihood under a saturated model (one parameter per data point), $hat{ell}_0$ is the maximised log-likelihood under a null model (intercept only), and $hat{ell}_{p}$ is the maximised log-likelihood under the model (intercept term and $p$ coefficients).
These deviance statistics play a role analogous to scaled versions of the sums-of-squares in linear regression.  It is easy to see that they satisfy the decomposition $D_{TOT} = D_{REG} + D_{RES}$, which is analogous to the decomposition of the sums-of-squares in linear regression.  In fact, in the case where you have a normal response distribution with a linear link function you get a linear regression model, and the deviance statistics reduce to the following:
$$begin{equation} begin{aligned}
D_{TOT} = frac{1}{sigma^2} sum_{i=1}^n (y_i - bar{y})^2 = frac{1}{sigma^2} cdot SS_{TOT}, [6pt]
D_{REG} = frac{1}{sigma^2} sum_{i=1}^n (hat{y}_i - bar{y})^2 = frac{1}{sigma^2} cdot SS_{REG}, [6pt]
D_{RES} = frac{1}{sigma^2} sum_{i=1}^n (y_i - hat{y}_i)^2 = frac{1}{sigma^2} cdot SS_{RES}. [6pt]
end{aligned} end{equation}$$
Now, the coefficient of variation in a linear regression model is a goodness-of-fit statistic that measures the proportion of the total variation in the response that is attributable to the explanatory variables.  A natural extension in the case of a GLM is to form the statistic:
$$R_{GLM}^2 = 1-frac{D_{RES}}{D_{TOT}} = frac{D_{REG}}{D_{TOT}}.$$
It is easily seen that this statistic reduces to the coefficient of variation in the special case of linear regression, since the scaling values cancel out.  In the broader context of a GLM the statistic has a natural interpretation that is analogous to its interpretation in linear regression: it gives the proportion of the null deviance that is explained by the explanatory variables in the model.
Now that we have seen how the sums-of-squares in linear regression extend to the deviances in a GLM, we can see that the regular coefficient of variation is inappropriate in the non-linear model, since it is specific to the case of a linear model with a normally distributed error term.  Nevertheless, we can see that although the standard coefficient of variation is inappropriate, it is possible to form an appropriate analogy using the deviance values, with an analogous interpretation.

$^dagger$ The residual deviance is sometimes just called the deviance.

Now as to the other part of the question:

If the answer to the above question is a no, then a follow-on question I would like to pose is: why is R2 the go-to measure for assessing the regression performance of a model or a technique?

My  answer to this is that $R^2$ is NOT the go-to measure for assessing regression performance. On the contrary, $R^2$ is very poor. I would recommend reading this thread on CV
Is $R^2$ Useful or Dangerous
This has currently 10 answers, many of which are good, but please pay particular attention to the answer by whuber.

is SST=SSE+SSR only in the context of linear regression?

One Answer

Add your own answers!

Ask a Question