In a predefined (but realistic, and explaining) example, what $p(x)$ should be in a variational autoencoder?

Question

I'm learning about variational autoencoders for a month now. The more I'm reading, and digging deep about it, the more confused I am.
Thinking about it an a neural network perspective probably doesn't help this, and I can't find the kind of questions I'm asking, so I'm definitely missing something.
In the tutorials/explanations, they say the initial goal is to calculate $p(z|x)$. The reason why we have to set up the whole network is to approximate this. And the reason why don't have it directly is because of from $frac{p(x|z)p(z)}{p(x)}$, $p(x)$ is intractable. This makes the impression that we know $p(x|z)$. We defined $p(z)$ as a standard normal distribution, that's clear. And even though we know that $p(x|z)$ is the output of decoder, if we look aside, or forget that we know this, it is not obvious at all. At least for me.
Are we looking at $x$, the input fed into the network as $p(x|z)$, and basically telling it "this is the probability of $x$ given $z$, figure out what's the probability of $z$ given $x$, if this is the probability of $z$"? But because it's missing $p(x)$, that's why we have to just approximate it? And this is the reason why we think about $p(x|z)$ as we know it?
How would it be possible to reproduce a "perfect" autoencoder, where we know $p(x)$, even if it's a very simple example? How $p(x)$ should be imagined in the context of variational autoencoders?

shimao · Accepted Answer

I think you might be mixing up between

the definition of various distributions
how the densities can be computed
how they can be sampled

Some examples:
$p(x)$ is the distribution modeled by the VAE.

It's defined as $p(x) = int p(x|z)p(z) dz$
In practice, if I gave you an $x$, you would have a hard time computing (even approximately), this integral.
You can sample from $p(x)$ without too much trouble -- just sample from $z sim p(z)$, then sample $x sim p(x|z)$

$p(x|z)$

is defined as a normal distribution with mean $f(z; theta)$, where $f$ is some arbitrary function (maybe a neural network).
it's just a normal distribution, so it's trivial to compute the density
and also trivial to sample from, again it's just drawing from a normal distribution.

When you write that

And even though we know that $p(x|z)$ is the output of decoder

you're confusing $f(z;theta)$ -- the output of the decoder, with the distribution.

In the tutorials/explanations, they say the initial goal is to
calculate $p(z|x)$

A better way to phrase this might be: remember when we said computing $p(x)$ was difficult? It turns out that it's really important to be able to compute $p(x)$ or $log p(x)$ efficiently. And also, it turns out that $log p(x) = E_{z sim p(z|x)}[log p(x|z)] - mathcal{D}_{KL}( p(z|x) || p(z) )$, so if we knew what $p(z|x)$ was, all our problems would be solved. Unfortunately, it's not practical to compute $p(z|x)$ either.
Using a normal distribution $q(z|x)$ to approximate $p(z|x)$ would make things much easier, since the first expectation can be approximated by monte carlo sampling, and KL divergence term has a closed form. The key to why a VAE works at all is that you can prove replacing $p(z|x)$ with any approximation $q(z|x)$ will result in a lower bound on $log p(x)$ -- you will never over-estimate $p(x)$, only under-estimate, which is crucial because we're trying to maximize $p(x)$. Without this property, there'd be no practical way to train a VAE.
To compare:
$p(z|x)$

defined as $frac{p(x,z)}{p(x)}$
difficult to compute
difficult to sample from

$q(z|x)$

It's defined as a normal distribution with mean and diagonal covariance computed by a neural network as a function of $x$.
It's easy to compute the density of (and more importantly, the KL divergence from $q$ to another normal distribution).
It's easy to sample from, since it's normal.

In a predefined (but realistic, and explaining) example, what $p(x)$ should be in a variational autoencoder?

One Answer

Add your own answers!

Ask a Question