Cross Validated Asked on March 3, 2021
I am studying Christopher’s Bishop book on "pattern recognition and machine learning". I have come across the regression loss function before, usually it is expressed as $$sum_{i=1}^N {t_i – y(x_i)}^2$$
where $t_i$ represents the true value, $y(x_i)$ represents the function to approximate $t_i$.
In the book however, the regression loss is written in the form
$$E[L] = int int L(t,y(x))p(x,t)dx ,dt$$
The expectation is taken with respect to samples from the joint distribution of $p(x,t)$. How do we go about thinking about the joint distribution of $p(x,t)$ ?
How do we actually compute the joint distribution for $p(x,t)$ in the regression sense. For classification, the Naive Bayes algorithm can be used to compute the distribution for $p(x,C)$ where $C in {C_1, C_2, …, C_k}$ classes from the data itself by combining the likelihood and the prior. Hence, $p(x,C_i)$ for classification is just a scalar value.
As said in the first comment, the first version is one to evaluate empirically on given data, whereas the second one with $E[L]$ is a theoretical population version for a general loss function $L$; in the first equation $L(t,y(x))=(t-y(x))^2$, but the second one can use other loss functions if desired (which will then have empirical versions as well, summing up other losses).
The given formula for $E[L]$ assumes both $x$ and $t$ to be random (sometimes in regression modeling $x$ is assumed to be fixed, but it's probably not worthwhile to go into this because it doesn't make that much of a difference for the question). Regarding the distribution $p(x,t)$, obviously for evaluating the theoretical $E[L]$ one needs to make some model assumptions about it, however in empirical data the best that we have is the empirical distribution of the $t$ and the $x$. Now if we evaluate the integral in $E[L]$ with $p$ being the empirical distribution of $(x,t)$ on a given dataset of size $N$ (i.e., every observed $(x_i,t_i)$ appears with probability $frac{1}{N}$), it is actually $frac{1}{N}sum_{i=1}^NL(t_i,y(x_i))$, and with the squared loss $frac{1}{N}sum_{i=1}^N(t_i-y(x_i))^2$ (if we consider $N$ as fixed, the factor $frac{1}{N}$ is just a constant that doesn't matter). This connects the two formulae. (When making model assumptions about $p$ in order to say something theoretical about $E[L]$, one would hope that they are more or less in line with the empirical distribution of a dataset to which theoretical results are applied.)
Correct answer by Lewian on March 3, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP