# Loss function for regression

Cross Validated Asked on March 3, 2021

I am studying Christopher’s Bishop book on "pattern recognition and machine learning". I have come across the regression loss function before, usually it is expressed as $$sum_{i=1}^N {t_i – y(x_i)}^2$$

where $$t_i$$ represents the true value, $$y(x_i)$$ represents the function to approximate $$t_i$$.

In the book however, the regression loss is written in the form
$$E[L] = int int L(t,y(x))p(x,t)dx ,dt$$

The expectation is taken with respect to samples from the joint distribution of $$p(x,t)$$. How do we go about thinking about the joint distribution of $$p(x,t)$$ ?

How do we actually compute the joint distribution for $$p(x,t)$$ in the regression sense. For classification, the Naive Bayes algorithm can be used to compute the distribution for $$p(x,C)$$ where $$C in {C_1, C_2, …, C_k}$$ classes from the data itself by combining the likelihood and the prior. Hence, $$p(x,C_i)$$ for classification is just a scalar value.

As said in the first comment, the first version is one to evaluate empirically on given data, whereas the second one with $$E[L]$$ is a theoretical population version for a general loss function $$L$$; in the first equation $$L(t,y(x))=(t-y(x))^2$$, but the second one can use other loss functions if desired (which will then have empirical versions as well, summing up other losses).

The given formula for $$E[L]$$ assumes both $$x$$ and $$t$$ to be random (sometimes in regression modeling $$x$$ is assumed to be fixed, but it's probably not worthwhile to go into this because it doesn't make that much of a difference for the question). Regarding the distribution $$p(x,t)$$, obviously for evaluating the theoretical $$E[L]$$ one needs to make some model assumptions about it, however in empirical data the best that we have is the empirical distribution of the $$t$$ and the $$x$$. Now if we evaluate the integral in $$E[L]$$ with $$p$$ being the empirical distribution of $$(x,t)$$ on a given dataset of size $$N$$ (i.e., every observed $$(x_i,t_i)$$ appears with probability $$frac{1}{N}$$), it is actually $$frac{1}{N}sum_{i=1}^NL(t_i,y(x_i))$$, and with the squared loss $$frac{1}{N}sum_{i=1}^N(t_i-y(x_i))^2$$ (if we consider $$N$$ as fixed, the factor $$frac{1}{N}$$ is just a constant that doesn't matter). This connects the two formulae. (When making model assumptions about $$p$$ in order to say something theoretical about $$E[L]$$, one would hope that they are more or less in line with the empirical distribution of a dataset to which theoretical results are applied.)

Correct answer by Lewian on March 3, 2021