Non-mutually exclusive classification sum of probabilities

Question

So I have the following problem: I realized (while writing my master thesis) that I am still not sure/have vague descriptions of some of the machine learning principles. I already asked one question regard definitions that can be found here.

Now I stumbled over another definition Problem.
Here is an excerpt from my thesis (this is in particular about neural-network classification):

If the classes are mutually exclusive (i.e. if a sample $x^{j} = C_{0}$, $x^{j} neq C_{i}setminus~C_{0}$ ), the probabilities of all classes add up to one like begin{equation}
sum_{i} P(x^{j}=C_{i}) = 1.
end{equation}
  In this case the best practice is to use a softmax activation function for the output neurons.
  If the classes are not mutually exclusive it would suffice to use a sigmoid output activation function, as the sigmoid function gets independent probabilities for each class begin{equation}
sum_{i} P(x^{j}=C_{i}) geq 1.
end{equation}

I already found the following link regarding this topic.
However I know that in practise if you don't use softmax activation function in your output layer, the value can be larger than 1 but can a probability be larger 1? Isn't that against its definition?

Is a non-mutual classification really a common case? Can somebody may be linking some cases (paper?) were they needed non-mutual classification?

n1k31t4 · Answer

You are correct: a probability cannot be larger than 1.

At the final layer, the activations (also known as logits), are passed through the final a softmax function in order to fulfill this constraint. The standard neural network does not have an implicit mechanism by it can ensure that constraint is met during training etc.

By non-mutual classification, we could be talking about something like classifying cat and dog images - in which case the label for each image either cat or dog. So they are mutually exclusive. This is a very common case - almost any form of image classification falls into this category.

You do not use a sigmomid function (or any other non-linearity for that matter) after the final layer, as there are no neurons following them, making a non-linearity somewhat redundant. Using a non-linearity for the purpose of fitting a non-linear model is different to the purpose of a final softmax function. This has exactly the purpose of scaling the final logits/activations into the nice range of [0, 1] that can be interpreted as probabilities. That allows us to make simple rules on how to classify the outputs - e.g. if p = [0.51, 0.49] then that sample was a cat, whereas p =[0.49, 0.51] is a dog.

I used those values in the example to highlight a further point; namely that you cannot interpret them as pure probabilities. Those examples don't mean the model was really unsure in both cases because all four "probabilities" we close to 0.5. The model gives more weight to the option is believes is correct - the relative magnitudes of those values are not directly interpretable.

Ben Reiniger · Answer

Just from the probabilistic side of things: When the classes are not mutually exclusive, the events $x^j=C_i$ are not disjoint, so in general (if every example gets some label),

$$1 = P(x^j=C_0 vee dotsb vee x^j=C_t) lneq sum_i P(x^j=C_i).$$

That is, the answer to

the value can be larger than 1 but can a probability be larger 1?

is "the value (of the sum) is not a probability, it's a sum of probabilities."

Is a non-mutual classification really a common case? Can somebody may be linking some cases (paper?) were they needed non-mutual classification?

This is commonly known as "multi-label classification."  Examples include topics/genres/themes, where a given item may include more than one.  I'll defer to the question you linked to provide other links.

Note that the probability being at least 1 above relied on every item receiving at least one label.  In cases where that's not the case, the sum may be less than 1.  And the outputs of a neural network (when individually sigmoid-activated) may sum to less than 1 even when every item receives a label, from imperfect calibration.

Finally, I'd just like to say that the notation $x^j=C_0, x^jneq C_isetminus C_0$ seems off to me; using $in$ and $notin$ seem better.

10xAI · Answer

the value can be larger than 1 but can a probability be larger 1? Isn't that against its definition?

Speaking in a very simple language how a model(NN) works - It doesn't know if it is a probability Or a number. It only knows that it has to minimise the Loss to match the output.

I see no reason why can't an output become > 1 if we don't use Sigmoid/Softmax and provide the model Y =[1, 0, 0, 1, 1] (For 5 data points). It can end ~1/-ve e.g. 1.05, -0.05 etc.
We use Sigmoid Or Softmax to covert this value to Class. Softmax has the additional property to magnify bigger values and stretch the difference in the input.

Is a non-mutual classification really a common case?

It's the case of multi-label Classification. A 10-Class image with [Cat, Dog, Flower] in one image will have Y= [1, 0, 0, 1, 1, 0, 0, 0, 0, 0]

I hope, I have not over-simplified things.

Non-mutually exclusive classification sum of probabilities

3 Answers

Add your own answers!

Ask a Question