Linear Discriminant Analysis' predictions newbie question

Question

When I use predict.lda in R (MASS package) which discriminant function does the software choose? Say, I have 4 classes and 3 discriminant functions, does the software always use the first discrimination function (highest trace) or does it use an ensemble of the three functions?

cdalitz · Answer

Unfortunately, the documentation of predict.lda does not shed any light on this question, but it gives a reference to "Pattern Recognition and Neural Networks" by Ripley (1996), who writes:

Fisher’s procedure cannot tell us the threshold between the two groups in classification. It seems common practice to classify by choosing the group whose mean is nearest in the space of canonical variats. Since in that space Euclidean distance is the within-group Mahalanobis distance, this corresponds to the Bayes rule if (and only if) the prior probabilities are equal.

This refers to the decision rule on the transformed variables, i.e., after projecting the data on the $C-1$ discriminant directions, where $C$ is the number of classes. In this space, predict.lda thus assigns a sample to the class of the nearest class mean value.
Concerning your question, beware that R's lda does not yield discriminant functions, but instead a matrix scaling $S$ that transforms the data into a $C-1$ dimensional subspace in such a way that the classes are optimally separated. The $C$ discriminant functions $g_i$ are then
begin{eqnarray*}
g_i(vec{x}) & = & - |S(vec{x} - vec{mu}_i)|^2 \
 & = & -underbrace{|Svec{x}|^2}_{mbox{irrelevant}} + 2langle Svec{x}, Svec{mu}_irangle - |Svec{mu}_i|^2
end{eqnarray*}
where $vec{mu}_i$ is the mean value of class $i$, and the minus sign has been added to bring the definition in line with the usual decision rule of choosing the class with the greatest discriminant function $g_i(vec{x})$. Note that the first term $|Svec{x}|^2$ is the same for all classes and can be omittted in the discriminant functon, thereby leading to an actually linear discriminant function.
This is only a decision rule and does not yield any posterior probabilities. To estimate these, a probabalistic model needs to be assumed. In the case of LDA, this model is a (multivariate) Gaussian distribution for each class, but with all covariance matrices assumed to be identical. In the transformed LDA space, this common covariance matrix is the unity matrix, which can then be inserted into the normal distribution to obtain probabilities.
Remark: if you drop the assumption of a common covraince matrix and allow for class specific covariance matrices, you end with "quadratic discriminant analysis" (R function qda).

Linear Discriminant Analysis' predictions newbie question

One Answer

Add your own answers!

Ask a Question