TransWikia.com

Training set Distribution and Activation function/Loss function correlation

Data Science Asked by Turned Capacitor on July 5, 2021

How should the probability distribution of the training set influence the choice of the activation function / loss function?
For instance if I have a Multinoulli distribution, which activation function should I choose? And why?
I can’t get this correlation between the probability distribution of the training set and the choice of the activation function / loss function.

One Answer

The probability distribution of the training set has normally nothing to do with the activation function/loss function. Instead, the activation function of the last layer and the loss function are directly defined by what you are trying to predict.

For instance:

  • If you have a regression problem where the output values are not bounded, you would probably use no activation in the last layers and a mean squared error as loss function.
  • If you want your network to perform binary classification, you would use a sigmoid activation in the last layer (which outputs a value between 0 and 1, associated with the probability of belonging to one class or the other), with a binary cross-entropy as loss function.
  • If you want your network to select among N elements (e.g. multiclass classification), you would probably want your network to predict a probability distribution over those N discrete elements. This is precisely a categorical/multinoulli distribution over a discrete output space (i.e. only one element is selected from N possible discrete alternatives). In this case, the activation of the last layer of your network should be a softmax, and the loss must be the categorical cross-entropy, also referred to as negative log-likelihood. Be careful, however, with frameword-specific details, because it is frequent that this pair of activation and loss is implemented with numerical stability considerations and you need to select the appropriate implementations. For instance, Pytorch, when you use NLLLoss, the last layer activation must be a LogSoftmax, not a normal Softmax. Alternatively, you can use CrossEntropyLoss which combines NLLLoss and LogSoftmax into a single class. These aspects are normally described in the framework documentation.

Correct answer by noe on July 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP