Data Science Asked by Math J on February 26, 2021
From Jon Shlens’s A Tutorial on Principal Component Analysis – version 1, page 7, section 4.5, II:
The formalism of sufficient statistics captures the notion that the
mean and the variance entirely describe a probability distribution.
The only zero-mean probability distribution that is fully described by
the variance is the Gaussian distribution. In order for this
assumption to hold, the probability distribution of $x_i$ must be
Gaussian.
($x_i$ denotes a random variable – the value of the $i^text{th}$ original feature.
i.e. the quote seems to claim that for the assumption to hold, each of the original features must be normally distributed.)
Why the Gaussian assumption and why might PCA fail if the data are not Gaussian distributed?
Edit: to give more info, in the tutorial, page 12, the author gave an example of non-Gaussian distributed data that causes PCA to fail.
Someone correct me if I'm wrong, but the PCA process itself doesn't assume anything about the distribution of your data. The PCA algorithm is simple -
The result will be an ordered list of orthogonal vectors (eigenvectors), and scales (eigenvalues). This set of vectors/values can be viewed as a summary of your data, particularly if all you care about is your data's variance.
I think there is an implicit assumption that the orthogonality implies independence of the resulting vectors, and from what I understand that's true if the data is Gaussian but not necessarily true in general. So I suppose whether your data can be modeled as Gaussian may or may not matter, depending on your use case.
Answered by tom on February 26, 2021
Most of the sources I have found (e.g. wikipedia) don't list Gaussian distribution as a requirement of PCA.
Moreover, it seems that Shlens himself doesn't believe that anymore:
I found 2 more versions of Shlens' tutorial: version 2 and version 3.02. The latter seems to be the current version (as Shlens' web page links to it), so I will refer only to version 3.02 in my answer.
In version 3.02, the paragraph you quoted was removed from the "Summary of Assumptions" section, so that currently, the section lists only the following assumptions:
In page 10 Shlens gives an example for when one might see the result of PCA as a failure, and then explains why PCA didn't really fail:
The solution to this paradox lies in the goal we selected for the analysis. The goal of the analysis is to decorrelate the data, or said in other terms, the goal is to remove second-order dependencies in the data. In the data sets of Figure 6, higher order dependencies exist between the variables. Therefore, removing second-order dependencies is insufficient at revealing all structure in the data.
i.e. PCs are guaranteed to be uncorrelated, so that's exactly what we should expect them to be. However, if we expected the PCs to be independent, then we would consider PCA to fail when the PCs aren't independent (e.g. the examples in figure 6).
(See this answer for another explanation of this paragraph.)
If we assume that the original dataset is Gaussian distributed (i.e. the features are jointly normally distributed), then by definition every linear combination of the original features is normally distributed.
Each of the PCs given by PCA is a linear combination of the original features. Thus, also every linear combination of the PCs is a linear combination of the original features, and so every linear combination of the PCs is normally distributed.
So, by definition the PCs are jointly normally distributed. PCA guarantees that the PCs are uncorrelated, and therefore they are also independent.
(Note that in the original paragraph quoted in the question, Shlens seems to claim that each of the original features should be normally distributed. However, I believe that was a mistake, and he actually meant that the original features should be jointly normally distributed (I deduced that's what he meant mainly from footnote 7 in page 10 in version 3.02). This answer explains why these conditions are not equivalent in the 2D case. Similarly, they aren't equivalent for any dimension $>1$.)
Thus, under the assumption that the original dataset is Gaussian distributed, PCA guarantees that the PCs are independent.
Answered by Oren Milman on February 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP