What is an autoencoder?

Question

I am a student and I am studying machine learning. I am focusing on deep generative models, and in particular to autoencoders and variational autoencoders (VAE).
I am trying to understand the concept, but I am having some problems.
So far, I have understood that an autoencoder takes an input, for example an image, and wants to reduce this image into a latent space, which should contain the underlying features of the dataset, with an operation of encoding, then, with an operation of decoding, it reconstrunct the image which has lost some information due to the encoding part.
After this, with a loss function, it reconstruct the latent space and so get the latent features.

about the VAE, it uses a probabilistic approch, so we have to learn the meand anv covariance of a gaussian.
So far this is what I have understood.
What I have really unclear is what are we trying to learn with autoencoders and VAE?
I have seen examples where an image goes feom a non smiling to a smiling face, or to a black and white image to a colored image.
But I don't understand the main concept, which is: what does an autoencoder do?
I add here some sources of where I studied so that who needs can see them:

https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694
https://www.youtube.com/watch?v=yFBFl1cLYx8
https://www.youtube.com/watch?v=9zKuYvjFFS8

cag51 · Accepted Answer

what does an auto-encoder do?

The simplest auto-encoder takes a high dimensional image (say, 100K pixels) down to a low-dimensional representation (say, a vector of length 10) and then uses only those 10 features to try to reconstruct the original image. You can imagine an analogy with humans: I look at someone, describe them ("tall, dark-haired, ...") then after I've forgotten what they look like, I try to sketch them using only my notes.

what are we trying to learn?

In other words, why bother? A few reasons:

dimensionality reduction: 10 features are a lot more convenient than 100K pixels. For example, I can perform classification by clustering in the 10-dimensional space (while clustering in the 100K-dimensional space would be intractable).
semantic meaning: if all goes well, each of the 10 features will have some obvious "explanation" -- e.g., tweaking one value will make the subject look older (though it's normally not so simple). As opposed to pixel values, which are impacted by translation, rotation, etc.
Exception recognition: if I train my auto-encoder on dogs, it should normally do a good job encoding and decoding pictures of dogs. But if I put a cat in, it will probably do a terrible job -- which I can tell because the output looks nothing like the input. So, looking for places where an auto-encoder does a bad job is a common way to look for anomalies.

I have seen examples where an image goes from a non smiling to a smiling face, or to a black and white image to a colored image.

There are many different types of auto-encoders. What I described above is the simplest kind. Another common type is a "denoising" auto-encoder -- instead of reconstructing the original image, the goal is to construct an image that is related to the original image, but different.
The classic example of this is denoising (hence the name): you can take a clean image, add a bunch of noise, run it through an auto-encoder, and then reward the auto-encoder for producing the clean image. So, the input (noisy image) is actually different from the desired output (clean image). The examples you give are similar.
The challenge in designing these types of auto-encoders is normally the loss -- you need some mechanism to tell the auto-encoder whether it did the right thing or not.

about the VAE, it uses a probabilistic approch, so we have to learn the mean and covariance of a gaussian.

A VAE is a third type of auto-encoder. It's a bit special because it is well-grounded mathematically; no ad-hoc metrics needed. The math is too complicated to go through here, but the key ideas are that:

We want the latent space to be continuous. Rather than assigning each class to its own corner of the latent space, we want the latent space to have a well-defined, continuous shape (i.e., a Gaussian). This is nice because it forces the latent space to be semantically meaningful.
The mapping between pictures and latent spaces should be probabilistic rather than deterministic. This is because the same subject can produce multiple images.

So, the workflow is this:

You start with your image as before
As before, your encoder determines a vector (say, length 200).
But that vector is not a latent space. Instead, you use that vector as the parameters to define a latent space. For example, maybe you choose your latent space to be a 100-dimensional Gaussian. A 100-dimensional Gaussian will require a mean and a standard deviation in each dimension -- this is what you use your length-200 vector for.
Now you have a probability distribution. You sample one point from this distribution. This is your image's representation in the latent space.
As before, your decoder will turn this vector into a new "output" (say, a vector of length 200K).
But, this "output" is not your output image. Instead, you use these 200K parameters to define a 100K-dimensional Gaussian. Then you sample one point from this distribution -- that's your output image.

Of course, there's nothing special about a Gaussian, you could just as easily use some other parametric distribution. In practice, people usually use Gaussians.
This sometimes gives better results than other auto-encoders. Further, you sometimes get interesting results when you look between the classes in your latent space. An image's distance in the latent space from the cluster center is sometimes related to uncertainty.
Moreover, there is the nice property that these high-dimensional Gaussians are probability distributions in a rigorous mathematical sense. They approximate the probability that a given image belongs to a given class. So, there is some thought that VAEs will be able to overcome the "hand waving" of deep learning and put everything back on a firm Bayesian probabilistic grounding. But of course, it is only an approximation, and the approximation involves a lot of deep neural networks, so there is still plenty of hand waving at the moment.
By the way, I like to use this question during interviews -- an astonishing number of people claim to have experience with VAEs but in fact do not realize that VAEs are different than "regular" AEs.

hH1sG0n3 · Answer

An easy way to think about autoencoders is: how well a prticlar pice of infrmaton can be reconstrcted frm its reducd or otherwse comprssed reprsentaton. If you made it this far it means that you sucessfully reconstructed the previous sentence by using only 92 of its original 103 characters.
More specifically, autoencoders are neural networks that are trained to learn efficient data codings in an unsupervised manner. The aim is to learn a representation of a given dataset, by training the network to ignore "not important" signals like noise. Typically AE are considered for dimensionality reduction.
Practically, an AE

initially compresses the input data into a latent-space representation
reconstructs the output from this latent-space representation
calculates the difference between the input and output which is defined as reconstruction loss.

In this training loop, the AE minimises this reconstruction loss so that the output is as similar to the input as possible.

Josh Purtell · Answer

One approach that I've found helpful when considering autoencoders is the following result: whereas methods such as PCA identify axes of maximal variation in the input space, the introduction of non-linear activation functions in the autoencoder allows for the identification of axes of maximal variation embedded in a (potentially) non-linear transform of the space.
As an example, consider data in  distributed according to the function , where . Here, the goal is to store inputs as one-dimensional compressions. A PCA approach could possibly introduce significant loss (as long as the support is sufficiently large), but an autoencoder with non-linearities will be able to identify the principal embedded axis in the transform space as the one with pre-image roughly at   in the input space, and therefore will introduce much less loss. You can think of the autoencoder training regime as working to approximate a transform functor which produces a transform space with a linear pre-image at . The autoencoder then works by storing inputs in terms of where they lie on the linear image of .
Observe that absent the non-linear activation functions, an autoencoder essentially becomes equivalent to PCA — up to a change in basis. A useful exercise might be to consider why this is.

What is an autoencoder?

3 Answers

Add your own answers!

Ask a Question