Calculating KL Divergence in Python

Question

I am rather new to this and can't say I have a complete understanding of the theoretical concepts behind this. I am trying to calculate the KL Divergence between several lists of points in Python. I am using this to try and do this. The problem that I'm running into is that the value returned is the same for any 2 lists of numbers (its 1.3862943611198906). I have a feeling that I'm making some sort of theoretical mistake here but can't spot it.
values1 = [1.346112,1.337432,1.246655]
values2 = [1.033836,1.082015,1.117323]
metrics.mutual_info_score(values1,values2)

That is an example of what I'm running - just that I'm getting the same output for any 2 input. Any advice/help would be appreciated!

Has QUIT--Anony-Mousse · Accepted Answer

First of all, sklearn.metrics.mutual_info_score implements mutual information for evaluating clustering results, not pure Kullback-Leibler divergence!

This is equal to the Kullback-Leibler divergence of the joint distribution with the product distribution of the marginals.

KL divergence (and any other such measure) expects the input data to have a sum of 1. Otherwise, they are not proper probability distributions. If your data does not have a sum of 1, most likely it is usually not proper to use KL divergence! (In some cases, it may be admissible to have a sum of less than 1, e.g. in the case of missing data.)

Also note that it is common to use base 2 logarithms. This only yields a constant scaling factor in difference, but base 2 logarithms are easier to interpret and have a more intuitive scale (0 to 1 instead of 0 to log2=0.69314..., measuring the information in bits instead of nats).

> sklearn.metrics.mutual_info_score([0,1],[1,0])
0.69314718055994529

as we can clearly see, the MI result of sklearn is scaled using natural logarithms instead of log2. This is an unfortunate choice, as explained above.

Kullback-Leibler divergence is fragile, unfortunately. On above example it is not well-defined: KL([0,1],[1,0]) causes a division by zero, and tends to infinity. It is also asymmetric.

Dawny33 · Answer

I'm not sure with the scikit-learn implementation, but here is a quick implementation of the KL divergence in Python:
import numpy as np

def KL(a, b):
    a = np.asarray(a, dtype=np.float)
    b = np.asarray(b, dtype=np.float)

return np.sum(np.where(a != 0, a * np.log(a / b), 0))

values1 = [1.346112,1.337432,1.246655]
values2 = [1.033836,1.082015,1.117323]

print KL(values1, values2)

Output:  0.775279624079
There might be conflict of implementation in some libraries, so make sure you read their docs before using.

jamesmf · Answer

Scipy's entropy function will calculate KL divergence if feed two vectors p and q, each representing a probability distribution. If the two vectors aren't pdfs, it will normalize then first.
Mutual information is related to, but not the same as KL Divergence.
"This weighted mutual information is a form of weighted KL-Divergence, which is known to take negative values for some inputs, and there are examples where the weighted mutual information also takes negative values"

Johann · Answer

This trick avoids conditional code and may therefore provide better performance.

import numpy as np

def KL(P,Q):
""" Epsilon is used here to avoid conditional code for
checking that neither P nor Q is equal to 0. """
     epsilon = 0.00001

# You may want to instead make copies to avoid changing the np arrays.
     P = P+epsilon
     Q = Q+epsilon

divergence = np.sum(P*np.log(P/Q))
     return divergence

# Should be normalized though
values1 = np.asarray([1.346112,1.337432,1.246655])
values2 = np.asarray([1.033836,1.082015,1.117323])

# Note slight difference in the final result compared to Dawny33
print KL(values1, values2) # 0.775278939433

bmc · Answer

Consider the three following samples from a distribution(s).

values1 = np.asarray([1.3,1.3,1.2])
values2 = np.asarray([1.0,1.1,1.1])
values3 = np.array([1.8,0.7,1.7])

Clearly, values1 and values2 are closer, so we expect the measure of surprise or entropy, to be lower when compared to values3.

from scipy.stats import entropy
print("nIndividual Entropyn")
print(entropy(values1))
print(entropy(values2))
print(entropy(values3))

print("nPairwise Kullback Leibler divergencen")
print(entropy(values1, qk=values2))
print(entropy(values1, qk=values3))
print(entropy(values2, qk=values3))

We see the following output:

Individual Entropy

1.097913446793334
1.0976250611902076
1.0278436769863724 #<--- this one had the lowest, but doesn't mean much.

Pairwise Kullback Leibler divergence

0.002533297351606588
0.09053972625203921 #<-- makes sense
0.09397968199352116 #<-- makes sense

We see this makes sense because the values between values1 and values3 and values 2 and values 3 are simply more drastic in change than values1 to values 2. This is my validation to understanding KL-D and the packages that can be leveraged for it.

Noam Peled · Answer

Kullback-Leibler divergence is basically the sum of the relative entropy of two probabilities:
vec = scipy.special.rel_entr(p, q)    
kl_div = np.sum(vec)

As mentioned before, just make sure p and q are probability distributions (sum up to 1). You can always normalize them before:
p /= np.sum(p)

Relative entropy is defined as p*log(p/q), so where q==0, the result is inf.
You can mask those values using:
vec = np.ma.masked_invalid(vec).compressed()

Calculating KL Divergence in Python

6 Answers

Add your own answers!

Ask a Question