Cross Validated Asked on March 3, 2021

I am studying Christopher’s Bishop book on "pattern recognition and machine learning". I have come across the regression loss function before, usually it is expressed as $$sum_{i=1}^N {t_i – y(x_i)}^2$$

where $t_i$ represents the true value, $y(x_i)$ represents the function to approximate $t_i$.

In the book however, the regression loss is written in the form

$$E[L] = int int L(t,y(x))p(x,t)dx ,dt$$

The expectation is taken with respect to samples from the joint distribution of $p(x,t)$. How do we go about thinking about the joint distribution of $p(x,t)$ ?

How do we actually compute the joint distribution for $p(x,t)$ in the regression sense. For classification, the Naive Bayes algorithm can be used to compute the distribution for $p(x,C)$ where $C in {C_1, C_2, …, C_k}$ classes from the data itself by combining the likelihood and the prior. Hence, $p(x,C_i)$ for classification is just a scalar value.

As said in the first comment, the first version is one to evaluate empirically on given data, whereas the second one with $E[L]$ is a theoretical population version for a general loss function $L$; in the first equation $L(t,y(x))=(t-y(x))^2$, but the second one can use other loss functions if desired (which will then have empirical versions as well, summing up other losses).

The given formula for $E[L]$ assumes both $x$ and $t$ to be random (sometimes in regression modeling $x$ is assumed to be fixed, but it's probably not worthwhile to go into this because it doesn't make that much of a difference for the question). Regarding the distribution $p(x,t)$, obviously for evaluating the theoretical $E[L]$ one needs to make some model assumptions about it, however in empirical data the best that we have is the empirical distribution of the $t$ and the $x$. Now if we evaluate the integral in $E[L]$ with $p$ being the empirical distribution of $(x,t)$ on a given dataset of size $N$ (i.e., every observed $(x_i,t_i)$ appears with probability $frac{1}{N}$), it is actually $frac{1}{N}sum_{i=1}^NL(t_i,y(x_i))$, and with the squared loss $frac{1}{N}sum_{i=1}^N(t_i-y(x_i))^2$ (if we consider $N$ as fixed, the factor $frac{1}{N}$ is just a constant that doesn't matter). This connects the two formulae. (When making model assumptions about $p$ in order to say something theoretical about $E[L]$, one would hope that they are more or less in line with the empirical distribution of a dataset to which theoretical results are applied.)

Correct answer by Lewian on March 3, 2021

Get help from others!

Recent Answers

- Lex on Does Google Analytics track 404 page responses as valid page views?
- Peter Machado on Why fry rice before boiling?
- Jon Church on Why fry rice before boiling?
- Joshua Engel on Why fry rice before boiling?
- haakon.io on Why fry rice before boiling?

Recent Questions

- How can I transform graph image into a tikzpicture LaTeX code?
- How Do I Get The Ifruit App Off Of Gta 5 / Grand Theft Auto 5
- Iv’e designed a space elevator using a series of lasers. do you know anybody i could submit the designs too that could manufacture the concept and put it to use
- Need help finding a book. Female OP protagonist, magic
- Why is the WWF pending games (“Your turn”) area replaced w/ a column of “Bonus & Reward”gift boxes?

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP