Why training a Restricted Boltzmann Machine corresponds to having a good reconstruction of training data?

Question

Many tutorials suggest that after training a RBM, one can have a good reconstruction of training data just like an autoencoder.
An example tutorial.
But the training process of RBM is essentially to maximize the likelihood of the training data. We usually use some technique like CD-K or PCD, so it seems that we can only say that a trained RBM has high probability to generate data which is like training data (digits if we use MNIST), but not correspond to reconstruction. Or are these two things just equivalent in some way?
Hinton said that it is not a good idea to use reconstruction error for monitoring the progress of training, this is why I have this question.

yell · Accepted Answer

1) I believe this comes from general property of Maximum Likelihood estimation, namely, that it is equivalent to minimization of KL-divergence between the model distribution and the (empirical) data distribution.

Proof: Suppose we have data $X={mathbf{x}_1 ldots mathbf{x}_N}$ drawn independently from unknown data generating process. To approximate its distribution we consider family of distributions $p_{text{model}}(;cdot;;theta)$ parameterized by parameters $theta$. Then
$$
theta_{MLE}
=text{arg}max_{theta}p_{text{model}}(X;theta)
=text{arg}max_{theta}prod_{n=1}^N p_{text{model}}(mathbf{x}_n;theta)
=
=text{arg}max_{theta}sum_{n=1}^N log p_{text{model}}(mathbf{x}_n;theta)
=text{arg}max_{theta}underbrace{left(frac{1}{N}sum_{n=1}^N log p_{text{model}}(mathbf{x}_n;theta)right)}_{  mathbb{E}_{mathbf{x}sim p_{text{data}}(mathbf{x})}left[log p_{text{model}}(mathbf{x};theta) right]},
$$
where $p_{text{data}}(mathbf{x})=frac{1}{N}sum_{n=1}^N delta_{mathbf{x}-mathbf{x}_n}$ is (empirical) data distribution. Now by definition
$$
D_{KL}(p_{text{data}} | p_{text{model}}(;cdot;;theta))=mathbb{E}_{mathbf{x}sim p_{text{data}}(mathbf{x})}left[log p_{text{data}}(mathbf{x})-log p_{text{model}}(mathbf{x};theta) right]
$$
Observe that the first term is independent of $theta$ thus minimizing KL-divergence is equivalent to minimizing only the second term, which is a quantity with minus sign of what Maximum Likelihood maximizes. $blacksquare$

Now if training data has high probability according to the RBM $Leftrightarrow$ model distribution of RBM is close to empirical data distribution, then taking into account that RBM is trained using MCMC-based algorithm (CD-k/PCD-k), if we assume that state of Markov Chain particles are close to their stationary distribution, then feeding training example into RBM shouldn't change it too much after reconstruction. Therefore the low reconstruction error is expected in this case, and typically this happens in practice.

2) But as you correctly pointed out, even though it is convenient to measure reconstruction error during the training process, it is a poor measure of what actually is going on inside RBM during learning, since this is not the function RBM is aim to optimize (RBM maximizes probability of visible units $p(mathbf{v})$ over all training data, which is equivalent to minimizing resp. KL-divergence).

Unfortunately, measuring (log-)likelihood directly is intractable in this model, but there exist better proxies to the true likelihood, such as pseudo-likelihood (more here or in the Hinton's practical guide you mentioned).

Why training a Restricted Boltzmann Machine corresponds to having a good reconstruction of training data?

One Answer

Add your own answers!

Ask a Question