Why does PCA assume Gaussian Distribution?

Question

From Jon Shlens's A Tutorial on Principal Component Analysis - version 1, page 7, section 4.5, II:

The formalism of sufficient statistics captures the notion that the
  mean and the variance entirely describe a probability distribution.
  The only zero-mean probability distribution that is fully described by
  the variance is the Gaussian distribution. In order for this
  assumption to hold, the probability distribution of $x_i$ must be
  Gaussian.

($x_i$ denotes a random variable - the value of the $i^text{th}$ original feature.i.e. the quote seems to claim that for the assumption to hold, each of the original features must be normally distributed.)

Why the Gaussian assumption and why might PCA fail if the data are not Gaussian distributed?

Edit: to give more info, in the tutorial, page 12, the author gave an example of non-Gaussian distributed data that causes PCA to fail.

tom · Answer

Someone correct me if I'm wrong, but the PCA process itself doesn't assume anything about the distribution of your data.  The PCA algorithm is simple -

find the direction of greatest variance in your data
write down the direction of the vector pointing in that direction, and 'divide' the data along that direction by its variance in that direction, so the resulting variance in that direction is 1.  This provides you with an eigenvector (direction) and associated eigenvalue (scale).
repeat steps 1-2, potentially as many times as you have dimensions, but with the constraint that the next vector must be orthogonal (aka at a right angle) to all previous.

The result will be an ordered list of orthogonal vectors (eigenvectors), and scales (eigenvalues).  This set of vectors/values can be viewed as a summary of your data, particularly if all you care about is your data's variance.

I think there is an implicit assumption that the orthogonality implies independence of the resulting vectors, and from what I understand that's true if the data is Gaussian but not necessarily true in general.  So I suppose whether your data can be modeled as Gaussian may or may not matter, depending on your use case.

Oren Milman · Answer

TL;DR

PCA does assume normal distribution of features  See p.55 SAS book1 or Rummel, 19702 or Mardia, 19793.
If you expect the PCs to be independent, then PCA might fail to live to your expectations.
Assuming that the dataset is Gaussian distributed would guarantee that the PCs are independent.

Long Answer
PCA doesn't assume the dataset to be Gaussian distributed
Most of the sources I have found (e.g. wikipedia) don't list Gaussian distribution as a requirement of PCA.
Moreover, it seems that Shlens himself doesn't believe that anymore:
I found 2 more versions of Shlens' tutorial: version 2 and version 3.02. The latter seems to be the current version (as Shlens' web page links to it), so I will refer only to version 3.02 in my answer.
In version 3.02, the paragraph you quoted was removed from the "Summary of Assumptions" section, so that currently, the section lists only the following assumptions:

Linearity
Large variances have important structure.
The principal components are orthogonal.

When might PCA fail to live to our (false) expectations?
In page 10 Shlens gives an example for when one might see the result of PCA as a failure, and then explains why PCA didn't really fail:

The solution to this paradox lies in the goal we selected for the
analysis. The goal of the analysis is to decorrelate the data, or
said in other terms, the goal is to remove second-order dependencies
in the data. In the data sets of Figure 6, higher order
dependencies exist between the variables. Therefore, removing
second-order dependencies is insufficient at revealing all
structure in the data.

i.e. PCs are guaranteed to be uncorrelated, so that's exactly what we should expect them to be. However, if we expected the PCs to be independent, then we would consider PCA to fail when the PCs aren't independent (e.g. the examples in figure 6).
(See this answer for another explanation of this paragraph.)
How can the assumption of Gaussian distribution help with our (false) expectations?
If we assume that the original dataset is Gaussian distributed (i.e. the features are jointly normally distributed), then by definition every linear combination of the original features is normally distributed.
Each of the PCs given by PCA is a linear combination of the original features. Thus, also every linear combination of the PCs is a linear combination of the original features, and so every linear combination of the PCs is normally distributed.
So, by definition the PCs are jointly normally distributed. PCA guarantees that the PCs are uncorrelated, and therefore they are also independent.
(Note that in the original paragraph quoted in the question, Shlens seems to claim that each of the original features should be normally distributed. However, I believe that was a mistake, and he actually meant that the original features should be jointly normally distributed (I deduced that's what he meant mainly from footnote 7 in page 10 in version 3.02). This answer explains why these conditions are not equivalent in the 2D case. Similarly, they aren't equivalent for any dimension $>1$.)
Thus, under the assumption that the original dataset is Gaussian distributed, PCA guarantees that the PCs are independent.

Why does PCA assume Gaussian Distribution?

2 Answers

TL;DR

Long Answer

PCA doesn't assume the dataset to be Gaussian distributed

When might PCA fail to live to our (false) expectations?

How can the assumption of Gaussian distribution help with our (false) expectations?

Add your own answers!

Ask a Question