TransWikia.com

In softmax classifier, why use exp function to do normalization?

Data Science Asked on February 24, 2021

Why use softmax as opposed to standard normalization? In the comment area of the top answer of this question, @Kilian Batzner raised 2 questions which also confuse me a lot. It seems no one gives an explanation except numerical benefits.

I get the reasons for using Cross-Entropy Loss, but how does that relate to the softmax? You said "the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth". Suppose, I would use standard / linear normalization, but still use the Cross-Entropy Loss. Then I would also try to minimize the Cross-Entropy. So how is the softmax linked to the Cross-Entropy except for the numerical benefits?

As for the probabilistic view: what is the motivation for looking at log probabilities? The reasoning seems to be a bit like "We use e^x in the softmax, because we interpret x as log-probabilties". With the same reasoning we could say, we use e^e^e^x in the softmax, because we interpret x as log-log-log-probabilities (Exaggerating here, of course). I get the numerical benefits of softmax, but what is the theoretical motivation for using it?

5 Answers

It is more than just numerical. A quick reminder of the softmax: $$ P(y=j | x) = frac{e^{x_j}}{sum_{k=1}^K e^{x_k}} $$

Where $x$ is an input vector with length equal to the number of classes $K$. The softmax function has 3 very nice properties: 1. it normalizes your data (outputs a proper probability distribution), 2. is differentiable, and 3. it uses the exp you mentioned. A few important points:

  1. The loss function is not directly related to softmax. You can use standard normalization and still use cross-entropy.

  2. A "hardmax" function (i.e. argmax) is not differentiable. The softmax gives at least a minimal amount of probability to all elements in the output vector, and so is nicely differentiable, hence the term "soft" in softmax.

  3. Now I get to your question. The $e$ in softmax is the natural exponential function. Before we normalize, we transform $x$ as in the graph of $e^x$:

natural exponential function

If $x$ is 0 then $y=1$, if $x$ is 1, then $y=2.7$, and if $x$ is 2, now $y=7$! A huge step! This is what's called a non-linear transformation of our unnormalized log scores. The interesting property of the exponential function combined with the normalization in the softmax is that high scores in $x$ become much more probable than low scores.

An example. Say $K=4$, and your log score $x$ is vector $[2, 4, 2, 1]$. The simple argmax function outputs:

$$ [0, 1, 0, 0] $$

The argmax is the goal, but it's not differentiable and we can't train our model with it :( A simple normalization, which is differentiable, outputs the following probabilities:

$$ [0.2222, 0.4444, 0.2222, 0.1111] $$

That's really far from the argmax! :( Whereas the softmax outputs: $$ [0.1025, 0.7573, 0.1025, 0.0377] $$

That's much closer to the argmax! Because we use the natural exponential, we hugely increase the probability of the biggest score and decrease the probability of the lower scores when compared with standard normalization. Hence the "max" in softmax.

Correct answer by vega on February 24, 2021

In addition to vega's explanation,

let's define generic softmax: $$P(y=j | x) = frac{psi^{x_j}}{sum_{k=1}^K psi^{x_k}}$$ where $psi$ is a constant >= 1

if $psi=1$, then you are pretty far from argmax as @vega mentioned.

Let's now assume $psi=100$, now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^{308}$). In addition to that, even if the selection is $psi=e$ which is much smaller than $100$, frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$) since results become to small to be able to express with such precision.

So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.

And of course, $e$ also has pretty nice derivative.

Answered by komunistbakkal on February 24, 2021

This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.

I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $mathcal{C}_1$, $n_2$ images from the class $mathcal{C}_2$, ..., and $n_K$ images from the class $mathcal{C}_K$. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.

The mean energy $bar{E}$ is related to the other energies $E_k$ by the following relationship begin{equation} Nbar{E} = sum_{k=1}^{K} n_k E_k.qquad (*) label{eq:mean_energy} end{equation}

At the same time, we see that the total amount of images can be calculated as the following sum

begin{equation} N = sum_{k=1}^{K}n_k.qquad (**) label{eq:conservation_of_particles} end{equation}

The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.

The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$, $n_2$, ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient

begin{equation} begin{pmatrix} N! n_1!,n_2!,ldots,n_K! end{pmatrix}=dfrac{N!}{prod_{k=1}^K n_k!}. end{equation}

We will try to maximize this number assuming that we have infinitely many images $Nto infty$. But his maximization has also equality constraints $(*)$ and $(**)$. This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $beta$ and $alpha$ for the equality constraints and we introduce the Lagrange Funktion $mathcal{L}left(n_1,n_2,ldots,n_k;alpha, beta right)$.

begin{equation} mathcal{L}left(n_1,n_2,ldots,n_k;alpha, beta right) = dfrac{N!}{prod_{k=1}^{K}n_k!}+betaleft[sum_{k=1}^Kn_k E_k - Nbar{E}right]+alphaleft[N-sum_{k=1}^{K} n_kright] end{equation}

As we assumed $Nto infty$ we can also assume $n_k to infty$ and use the Stirling approximation for the factorial

begin{equation} ln n! = nln n - n + mathcal{O}(ln n). end{equation}

Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $ln n!$ for $nto infty$.

The partial derivative of the Lagrange function with respect $n_tilde{k}$ will result in

$$dfrac{partial mathcal{L}}{partial n_tilde{k}}=-ln n_tilde{k}-1-alpha+beta E_tilde{k}.$$

If we set this partial derivative to zero we can find

$$n_tilde{k}=dfrac{exp(beta E_tilde{k})}{exp(1+alpha)}. qquad (***)$$

If we put this back into $(**)$ we can obtain

$$exp(1+alpha)=dfrac{1}{N}sum_{k=1}^Kexp(beta E_k).$$

If we put this back into $(***)$ we get something that should remind us of the softmax function

$$n_tilde{k}=dfrac{exp(beta E_tilde{k})}{dfrac{1}{N}sum_{k=1}^Kexp(beta E_k)}.$$

If we define $n_tilde{k}/N$ as the probability of class $mathcal{C}_tilde{k}$ by $p_tilde{k}$ we will obtain something that is really similar to the softmax function

$$p_tilde{k}=dfrac{exp(beta E_tilde{k})}{sum_{k=1}^Kexp(beta E_k)}.$$

Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $beta E_tilde{k}=boldsymbol{w}^T_kboldsymbol{x}$ we exactly get the definition of the softmax function for the $k^{text{th}}$ output.

Answered by MachineLearner on February 24, 2021

Apart from the great answers mentioned here, one should also think about shift-invariance of softmax (or for that matter any exponential functions). Consider logits output from a classifier network (3 classes) [a, b, c]. Then the probability distribution will remain invariant even if it had been [a+x, b+x, c+x].

For example, if we consider e^e^x - e for normalization (as was mentioned in one of the comments). We miss out on this nice property. Now you may ask why is shift-invariance desired versus say scale invariance of logits. I find shift-invariance as an intuitively better design choice since and do not know a more theoretically backed reasoning or if it exists,

Secondly, as for using any other exponents on softmax other than e, it is being used. Look into learning classification with temperature and is a common technique in machine learning. So yes the softmax outputs may not correspond to probabilities and temperature scaling is used to calibrate these probabilities (where temperature may be learnt or treated as a hyperparameter)

Answered by Naman on February 24, 2021

In addition to above answers, the softmax should have the following properties

  • Monotonically increasing: larger inputs must give larger outputs.
  • Non-negative outputs: Probability values must be non-negative.
  • The outputs should sum to one: Probability values must sum to one

Here the non-negativity property is ensured by the exponential function.

Answered by ibilgen on February 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP