
Recall the log-posterior for \(X=x\) in Class \(k\) as \[ \begin{aligned} \log q_k(x) & =-\log\left[ \left( 2\pi\right) ^{p/2}\left\vert \Sigma_{k}\right\vert ^{1/2}\right] +\log\pi_{k}-\log f\left( x\right) \\ & -\frac{1}{2}x^{T}\Sigma_{k}^{-1}x+x^{T}\Sigma_{k}^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^{T}\Sigma_{k}^{-1}\mu_{k} \end{aligned} \]
If all \(\Sigma_{k}=\Sigma\), the discriminant function \(\delta_k(x)\) is a linear function in \(x\), the “feature variables”, and is thus called linear discriminant function
Linear discriminant analysis (LDA) uses an estimate \(\hat{\delta}_k(x)\) of \(\delta_k(x)\) to approximate the Bayes classifier when all \(\Sigma_{k}=\Sigma\)
For \(k,l \in \mathcal{G}\) \[ \begin{aligned} D_{k,l}(x) & = \log \left(\frac{q_k(x)}{q_l(x)} \right) = \delta_k(x)-\delta_l(x)\\ & = \log\frac{\pi_{k}}{\pi_{l}}-\frac{1}{2}\left( \mu_{k}+\mu_{l}\right)^T\Sigma^{-1}\left(\mu_{k}-\mu_{l}\right)\\ & \quad +x^{T}\Sigma^{-1}\left( \mu_{k}-\mu_{l}\right) \end{aligned} \]
So, \(\delta_k(x)=\delta_l(x)\) iff (if and only if) \(D_{k,l}(x)=0\), i.e., the decision boundary between Class \(k\) and Class \(l\) is the solution set \[ B_{k,l}=\{x \in \mathbb{R}^p: D_{k,l}(x)=0\}, \] which is a hyperplane in \(\mathbb{R}^p\)
Explanation on why \(B_{k,l}\) is a hyperplane:
So, the solution \(x\) to \(D_{k,l}(x)=0\) is by definition a hyperplane in \(\mathbb{R}^p\)
When \(p=1\), the solution is \(x=-b_{k,l}/\mathbf{c}_{k,l}\) if \(\mathbf{c}_{k,l} \ne 0\)
In practice, none of the \(\pi_{k}\)’s, \(\mu_{k}\)’s and \(\Sigma\) is known, and they need to be estimated. In the following, the hat \(\hat{}\) over a symbol denotes an estimate of the symbol.
For LDA, the \(K\) component Gaussian densities have mean vectors \(\mu_k\) and a common covariance matrix \(\Sigma\).
Regardless of the relative magnitudes of \(p\) and \(n\), we need to estimate \(K\), the number of classes, since \(K\) is unknown.
\(K\) can be chosen by the Akaike information criterion (AIC) or Bayesian information criterion (BIC), which balances the predicitive performance of the Gaussian mixture model and its complexity
Here “model complexity” takes into account \(K\) (roughly in the sense that the more parameters a model has, the more complicated it is)
Consider the \(2\)-group Gaussian mixture with \[ \pi_1=\pi_2=0.5, \quad \mu_1 \ne \mu_2, \quad \sigma_1^2=\sigma_2^2=1 \] Then \[\delta_k(x) = x \mu_k - \mu_k^2/2 + \log\pi_{k}\]
Test error: Bayes classifier (10.6%), LDA (11.1%)

When \(p=2\) and \(\Sigma_k=\Sigma\),
If further \(\pi_k = \pi_0\), the pairwise decision boundary for Class \(k\) and \(l\) is the solution set to
\[ \begin{aligned} \frac{1}{2}\left( \mu_{k}+\mu_{l}\right)^T\Sigma^{-1}\left(\mu_{k}-\mu_{l}\right) =x^{T}\Sigma^{-1}\left( \mu_{k}-\mu_{l}\right) \end{aligned} \]
Test error: Bayes classifier (0.0746), LDA (0.0770)

Example from Supplementary Text:

Default data setDefault data set (in R library ISLR) on credit card users:
default of a user on credit card payments, where default takes value Yes or Noincome, monthly credit card balance, user being student (with value Yes) or not (with value No)default=Yes, and the rest default=NoTarget: apply LDA to classify a user’s status of default using some of the features
Note: not all 10,000 observations are shown in the left panel
LDA with features balance and student:

A verification on Table 4.4, by applying LDA with features balance and student to training set:
TrueDefaultStatus
LDAEstimtedDefaultStatus No Yes
No 9644 252
Yes 23 81
student as a feature variable, unless the two values student=Yes and student=No are taken numerically and approximated by two Gaussian distributions with very small variances (that are close to \(0\)), respectively TrueDefaultStatus
LDAEstimtedDefaultStatus No Yes
No 9644 252
Yes 23 81
default=No for a user will achieve an error rate that is only a bit higher than the LDA training error TrueDefaultStatus
LDAEstimtedDefaultStatus No Yes
No 9644 252
Yes 23 81
Summary: on the training set, LDA has a low overall error rate but very high error rate on the defaulter class
TrueDefaultStatus
LDAEstimtedDefaultStatus No Yes
No 9644 252
Yes 23 81
On the training set, LDA has
Note: sensitivity and specificity are measures on correct classifications
Some reasons for LDA’s poor performance on training set:
balance and student status probably are not features that are able to distinguish well defaulters from non-defaultersClasswise density of standardized balance:

student status: “No”\(\mapsto 0\) and “Yes”\(\mapsto 1\); values \(0\) and \(1\) are used as Gaussian means in component densities

Bayes classifier uses the 0.5-treshold rule, i.e., it assigns Yes to a user’s default status if \[
q_1(x)=\Pr(\text{default}=\text{Yes}|X=x)>0.5,
\] where \(X\) is the feature vector
LDA uses an estimate \(\hat{q}_1\) of \(q_1\), assigns Yes to a user’s default status if \(\hat{q}_1(x)>0.5\), and approximates Bayes classifier
The 0-1 loss assigns equal weights to both types of missclassification, and the Bayes classifier minimizes the expected 0-1 loss
Equal weights on both types of misclassification:

When classifying a non-defaulter as defaulter is less problematic than classifying a defaulter as a non-defaulter, assigning equal weights to both misclassification results, as done by the 0-1 loss, may be inappropriate
default status is set as Yes if \[
q_1(x)=\Pr(\text{default}=\text{Yes}|X=x)>0.2
\]However, using a threshold different than 0.5 is equivalent to not using the Bayes classifier under the 0-1 loss

LDA with features balance and student, applied to training set but with modified decision rule via \[\hat{q}_1(x)=\widehat{\Pr}(\text{default}=\text{Yes}|X=x)>0.2\]
TrueDefaultStatus
mLDAEstimtedDefaultStatus No Yes
No 9432 138
Yes 235 195
LDA with posterior threshold 0.5: overall error rate 2.75%; error rate on defaulters 75.7%
TrueDefaultStatus
LDAEstimtedDefaultStatus No Yes
No 9644 252
Yes 23 81
LDA with posterior threshold 0.2: overall error rate 3.73%; error rate on defaulters 41.4%
TrueDefaultStatus
mLDAEstimtedDefaultStatus No Yes
No 9432 138
Yes 235 195

Threshold value 0.5 minimizes overall error rate, since the Bayes classifier uses this value and has the lowest overall error rate (when models are specified correctly, which for LDA is the Gaussian mixture model). However, this value gives high error rate on the defaulter class
A the threshold is reduced, error rate on the defaulter class decreases but error rate on the non-defaulters class increases
Domain knowledge is needed to determine which threshold value to use
The AUC is 0.95 for this data set:

In the example: “default=Yes” \(\mapsto +\); “default=No” \(\mapsto -\)

Recall the log-posterior for \(X=x\) in Class \(k\) as \[ \begin{aligned} \log q_k(x) & =-\log\left[ \left( 2\pi\right) ^{p/2}\left\vert \Sigma_{k}\right\vert ^{1/2}\right] +\log\pi_{k}-\log f\left( x\right) \\ & -\frac{1}{2}x^{T}\Sigma_{k}^{-1}x+x^{T}\Sigma_{k}^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^{T}\Sigma_{k}^{-1}\mu_{k} \end{aligned} \]
When \(\Sigma_{k}\)’s are different, for \(k,l \in \mathcal{G}\) \[ \begin{aligned} D_{k,l}(x) & = \log \left[{q_k(x) }/{q_l(x) } \right] = \delta_k(x)-\delta_l(x)\\ & =\log\frac{\pi_{k}}{\pi_{l}}- \frac{1}{2}\log \frac{\vert \Sigma_k\vert}{\vert \Sigma_l\vert} -\frac{1}{2}\left(x-\mu_{k}\right)^T\Sigma_k^{-1}\left(x-\mu_{k}\right) \\ & \quad +\frac{1}{2}\left(x-\mu_{l}\right)^T\Sigma_l^{-1}\left(x-\mu_{l}\right) \end{aligned} \]
So, \(\delta_k(x)=\delta_l(x)\) iff \(D_{k,l}(x)=0\), i.e., the decision boundary between Class \(k\) and Class \(l\) is the solution set \[ B_{k,l}=\{x \in \mathbb{R}^p: D_{k,l}(x)=0\}, \] which is NOT a hyperplane in \(\mathbb{R}^p\) when \(\Sigma_{k}\)’s are different
Explanation on why \(B_{k,l}\) is not a hyperplane:
In practice, none of the \(\pi_{k}\)’s, \(\mu_{k}\)’s and \(\Sigma_k\)’s are known, and they need to be estimated.
For QDA, the \(K\) component Gaussian densities have mean vectors \(\mu_k\) and covariance matrices \(\Sigma_k\).
When there are \(p\) features, there are \((K-1)\times \left[p(p+3)/2+1\right]\) parameters to estimate, since there are \(K-1\) differences \(\delta_k(x)-\delta_K(x)\) and each difference has respectively \(p(p+1)/2\), \(p\) and \(1\) parameters for its quadratic, linear and 0th order term
The MLEs of \(\mu_k\) and \(\Sigma_k\) usually work well when \(p\) is smaller than the sample size \(n\). But when \(p >n\), they are usually not accurate and the plug-in \(\hat{\delta}_{k}\) may not perform well in classification
Choice of LDA or QDA depends on data:

LDA on transformed feature variables:
Caution: LDA with original feature variables has linear decision boundary in original feature space. However, LDA with nonlinearly transformed feature variables has nonlinear decision boundary in original feature space.
Explanation on functional/statistical independence:
Figure 4.1 from Supplementary Text:

Figure 4.6 from Supplementary Text: LDA on transformed features may perform close to QDA on original features

When \(p >n\), other estimates of \(\Sigma_k\) (different than the MLE) are recommended, such as
For the regularized estimate \[\hat{\Sigma}_{k,\alpha}=\alpha \hat{\Sigma}_k + (1-\alpha) \hat{\Sigma}, \alpha \in [0,1]\]
DA with regularized or shrinkage estimates is called regularized discriminant analysis (RDA)
Note: contents on RDA adopted from Section 4.3.1 of Supplementary Text
Figure 4.4 from Supplementary Text:

Figure 4.7 from Supplementary Text:

Default dataDefault data set (in R library ISLR) on credit card users:
default of a user on credit card payments, where default takes value Yes or Noincome, monthly credit card balance, user being student (with value Yes) or not (with value No)default=Yes, and the rest default=NoTarget: apply LDA to classify a user’s status of default using some of the features
Default dataCheck Gaussian assumption: student hidden in contours

Classification results of QDA with features balance and income on training set and “0.5-threshold rule” of default=Yes if \(\widehat{\Pr}(\text{default}=\text{Yes}|X=x)>0.5\):
TrueDefaultStatus
QDAEstimtedDefaultStatus No Yes
No 9637 241
Yes 30 92
AUC:
[1] 0.9489247
Average of 10 ROC curves obtained by 10-fold cross-validation; color key for threshold value on posterior of default class in R

Difference 1:
Logistic regression does not employ a mixture model, and it takes \[q_k(x)=\Pr\left(Y=k|X=x\right)\] as the conditional probability of \(X=x\) when it is generated using Class \(k\) conditional density \(f_k\)
LDA employs a Gaussian mixture model and takes \[q_k(x)=\Pr\left(Y=k|X=x\right)\] as the posterior probability of \(X=x\) in Class \(k\) after \(x\) is obtained
Similarity 1: both methods employ linear modelling equations for log-odds
Recall \[ \begin{aligned} \Pr\left( Y=k|X=x\right)= \left[\Pr\left(Y=k\right)\Pr(X=x|Y=k)\right] \div {f(x)}, \end{aligned} \] where \(f\) is the marginal density of \(X\)
Difference 2:
Logistic regression does not need the marginal distribution \(F\) of \(X\), and maximizes the conditional likelihood \(\Pr\left(Y=k|X\right)\) given observations from \(X\)
LDA aims to obtain maximal posterior \(\Pr(Y=g|X)\) when it classifies \(X\) into Class \(g\) and hence implicitly uses \(F\) to maximize the joint likelihood on \((X,Y)\)
Difference 3:
If observations can be perfectly separately by a hyperplane into two classes, then for logistic regression the maximum likelihood estimates of the regression parameters are undefined (since logistic regression does not use the marginal distribution of feature variables)
However, in this case, the LDA coefficients for the same data will be well defined and LDA can be implemented, since LDA uses the marginal distribution of feature variables and this marginal distribution will not permit such degeneracies in parameter estimation
2 features and 2 classes
100 random training data sets
6 methods: kNN-1, kNN-CV, LDA, Logistic regression, and QDA
On each training set, fit each of 6 methods to the training data, and obtain the resulting test error on a large test set
Scenario 1: uncorrelated Gaussian feature variables, with 20 observations per class; LDA overall winner
Scenario 2: equally correlated Gaussian feature variables across component bivariate Gaussian densities, with 20 observations per class; LDA overall winner
Scenario 3: uncorrelated Student t feature variables with 50 observations per class, violating assumptions of LDA and QDA; logistic regression overall winner
Winner: LDA (Scenario 1 and 2); Logistic (Scenario 3)

Scenario 4: unequally correlated Gaussian feature variables across component bivariate Gaussian densities; QDA overall winner
Scenario 5: uncorrelated Gaussian feature variables within each class but class labels sampled from the logistic function using squared features; QDA overall winner
Scenario 6: as Scenario 5 but class labels sampled from a more complicated non-linear function of features; kNN-CV overall winner
Winner: QDA (Scenario 4 and 5); kNN-CV (Scenario 6)

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12