# In a predefined (but realistic, and explaining) example, what $p(x)$ should be in a variational autoencoder?

Cross Validated Asked by Gergő Horváth on February 19, 2021

I’m learning about variational autoencoders for a month now. The more I’m reading, and digging deep about it, the more confused I am.

Thinking about it an a neural network perspective probably doesn’t help this, and I can’t find the kind of questions I’m asking, so I’m definitely missing something.

In the tutorials/explanations, they say the initial goal is to calculate $$p(z|x)$$. The reason why we have to set up the whole network is to approximate this. And the reason why don’t have it directly is because of from $$frac{p(x|z)p(z)}{p(x)}$$, $$p(x)$$ is intractable. This makes the impression that we know $$p(x|z)$$. We defined $$p(z)$$ as a standard normal distribution, that’s clear. And even though we know that $$p(x|z)$$ is the output of decoder, if we look aside, or forget that we know this, it is not obvious at all. At least for me.

Are we looking at $$x$$, the input fed into the network as $$p(x|z)$$, and basically telling it "this is the probability of $$x$$ given $$z$$, figure out what’s the probability of $$z$$ given $$x$$, if this is the probability of $$z$$"? But because it’s missing $$p(x)$$, that’s why we have to just approximate it? And this is the reason why we think about $$p(x|z)$$ as we know it?

How would it be possible to reproduce a "perfect" autoencoder, where we know $$p(x)$$, even if it’s a very simple example? How $$p(x)$$ should be imagined in the context of variational autoencoders?

I think you might be mixing up between

1. the definition of various distributions
2. how the densities can be computed
3. how they can be sampled

Some examples:

$$p(x)$$ is the distribution modeled by the VAE.

1. It's defined as $$p(x) = int p(x|z)p(z) dz$$
2. In practice, if I gave you an $$x$$, you would have a hard time computing (even approximately), this integral.
3. You can sample from $$p(x)$$ without too much trouble -- just sample from $$z sim p(z)$$, then sample $$x sim p(x|z)$$

$$p(x|z)$$

1. is defined as a normal distribution with mean $$f(z; theta)$$, where $$f$$ is some arbitrary function (maybe a neural network).
2. it's just a normal distribution, so it's trivial to compute the density
3. and also trivial to sample from, again it's just drawing from a normal distribution.

When you write that

And even though we know that $$p(x|z)$$ is the output of decoder

you're confusing $$f(z;theta)$$ -- the output of the decoder, with the distribution.

In the tutorials/explanations, they say the initial goal is to calculate $$p(z|x)$$

A better way to phrase this might be: remember when we said computing $$p(x)$$ was difficult? It turns out that it's really important to be able to compute $$p(x)$$ or $$log p(x)$$ efficiently. And also, it turns out that $$log p(x) = E_{z sim p(z|x)}[log p(x|z)] - mathcal{D}_{KL}( p(z|x) || p(z) )$$, so if we knew what $$p(z|x)$$ was, all our problems would be solved. Unfortunately, it's not practical to compute $$p(z|x)$$ either.

Using a normal distribution $$q(z|x)$$ to approximate $$p(z|x)$$ would make things much easier, since the first expectation can be approximated by monte carlo sampling, and KL divergence term has a closed form. The key to why a VAE works at all is that you can prove replacing $$p(z|x)$$ with any approximation $$q(z|x)$$ will result in a lower bound on $$log p(x)$$ -- you will never over-estimate $$p(x)$$, only under-estimate, which is crucial because we're trying to maximize $$p(x)$$. Without this property, there'd be no practical way to train a VAE.

To compare:

$$p(z|x)$$

1. defined as $$frac{p(x,z)}{p(x)}$$
2. difficult to compute
3. difficult to sample from

$$q(z|x)$$

1. It's defined as a normal distribution with mean and diagonal covariance computed by a neural network as a function of $$x$$.
2. It's easy to compute the density of (and more importantly, the KL divergence from $$q$$ to another normal distribution).
3. It's easy to sample from, since it's normal.

Correct answer by shimao on February 19, 2021