Cross Validated Asked on November 2, 2021
I usually see a discussion of the following loss functions in the context of the following types of problems:
However, my understanding (see here) is that doing MLE estimation is equivalent to optimizing the negative log likelihood (NLL) which is equivalent to optimizing KL and thus the cross entropy.
So:
Related questions:
In a regression problem you have pairs $(x_i, y_i)$. And some true model $q$ that characterizes $q(y|x)$. Let's say you assume that your density
$$f_theta(y|x)= frac{1}{sqrt{2pisigma^2}} expleft{-frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2right}$$
and you fix $sigma^2$ to some value
The mean $mu(x_i)$ is then e.g. modelled via a a neural network (or any other model)
Writing the empirical approximation to the cross entropy you get:
$$sum_{i = 1}^n-logleft( frac{1}{sqrt{2pisigma^2}} expleft{-frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2right} right)$$
$$=sum_{i = 1}^n-logleft( frac{1}{sqrt{2pisigma^2}}right) +frac{1}{2sigma^2}(y_i-mu_theta(x_i))^2$$
If we e.g. set $sigma^2 = 1$ (i.e. assume we know the variance; we could also model the variance than our neural network had two ouputs, i.e. one for the mean and one for the variance) we get:
$$=sum_{i = 1}^n-logleft( frac{1}{sqrt{2pi}}right) +frac{1}{2}(y_i-mu_theta(x_i))^2$$
Minimizing this is equivalent to the minimization of the $L2$ loss.
So we have seen that minimizing CE with the assumption of normality is equivalent to the minimization of the $L2$ loss
Answered by Sebastian on November 2, 2021
The mean squared error is the cross-entropy between the data distribution $p^*(x)$ and your Gaussian model distribution $p_{theta}$. Note that the standard MLE procedure is:
$$ begin{align} max_{theta} E_{x sim p^*}[log p_{theta}(x)] &= min_{theta} left(- E_{x sim p^*}[log p_{theta}(x)]right)\ &= min_{theta} H(p^* Vert p_{theta}) \ &approx min_{theta} sum_i frac{1}{2} left(Vert x_i - theta_1Vert^2/theta_2^2 - log 2 pi theta_2^2right) end{align} $$
Where $H(p^* Vert p_{theta})$ denotes the CE and we use a Monte Carlo approximation to the expectation. And as you stated, this is equivalent to minimizing the KL divergence between the data distribution and your model distribution. Commonly the variance $theta_2$ is fixed and drops out of the objective.
Some people get confused because certain textbooks introduce the cross-entropy in terms of the Bernoulli/Categorical distribution (almost all machine learning libraries are guilty of this!), but it applies more generally than the discrete setting.
Answered by Eweler on November 2, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP