
The USArrests data set (included in the library ISLR or ggplot2) contains arrests per 100,000 residents in each of the 50 US states in 1973. There are
Murder, Assault, Rape, and UrbanPop (percent of population living in urban areas)> library(ISLR); dim(USArrests)
[1] 50 4
> head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
> colSums(USArrests)
Murder Assault UrbanPop Rape
389.4 8538.0 3277.0 1061.6
Description of data:
Murder, Assault, Rape and UrbanPop enables us to know completely the remaining feature, and a 4th feature is redundant and does not have to be measuredData matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\):
4 features, each with 50 observations; Murder, Assault and Rape are considerably correlated; UrbanPop has low correlation with these 3 features

50 observations, each for a US state; sample correlation between two observations insensible for features

Variability of each (raw) feature

Dendrogram does not reveal along which orthogonal directions observations vary the most successively

For any positive semi-definite matrix \(\mathbf{A} \in \mathbb{R}^{p \times p}\) (whose eigenvalues are all non-negative) and any \(q \le \text{rank}(\mathbf{A})\), the orthonormal (i.e., mutually orthogonal, unit) eigenvectors (each determined up to \(\pm\) signs) that are associated with the \(q\) largest eigenvalues of \(\mathbf{A}\) are called “the first \(q\) eigenvectors” of \(\mathbf{A}\). The \(q\) largest eigenvalues of \(\mathbf{A}\) are called the first \(q\) eigenvalues of \(\mathbf{A}\)
Since each \((d_i,\tilde{\mathbf{v}}_i)\) is an eigenpair of \(\tilde{\mathbf{S}}=\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\) that has eigenvalues \(d_1 \ge \cdots \ge d_p>0\), then for any \(q \le p\), the orthonormal eigenvectors \(\{\tilde{\mathbf{v}}_i\}_{i=1}^q\) are the first \(q\) eigenvectors and \(d_1,\ldots,d_q\) the first \(q\) eigenvalues of \(\tilde{\mathbf{S}}\)
> library(ISLR)
> # center data only
> xcentered=scale(USArrests,center=TRUE,scale=FALSE)
> # total variability of only centered data
> sum(xcentered^2)
[1] 355807.8
> # PCA for only centered data
> pca1=prcomp(xcentered)
> # directions or loading vectors
> pca1$rotation
PC1 PC2 PC3 PC4
Murder 0.04170432 -0.04482166 0.07989066 -0.99492173
Assault 0.99522128 -0.05876003 -0.06756974 0.03893830
UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914
Rape 0.07515550 0.20071807 0.97408059 0.07232502
> # standardized data
> Xsd = scale(USArrests,center=TRUE,scale=TRUE)
> # total variability of standardized data
> sum(Xsd^2)
[1] 196
> # PCA for standardized data
> pcaXsd = prcomp(Xsd)
> # directions or loading vectors
> pcaXsd$rotation
PC1 PC2 PC3 PC4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432
> # successive extremal lengths/variabilities of prejections
> (nrow(USArrests)-1)*((pcaXsd$sdev)^2)
[1] 121.531837 48.498492 17.471596 8.498074
UrbanPop for \(\phi_2=\tilde{\mathbf{v}}_2\)

Recall \((d_i,\tilde{\mathbf{v}}_i)\) is an eigenpair for \(\tilde{\mathbf{S}}=\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\) that has eigenvalues \(d_1 \ge \cdots \ge d_p>0\)
Recall total variability of standardized data \(\tilde{\mathbf{X}}\) as \[ \sigma_{\text{total}}^2=\sum\nolimits_{i=1}^n \sum\nolimits_{j=1}^p \tilde{x}_{ij}^2 = (n-1)p \]
Caution: for a data matrix \(\mathbf{X}=(x_{ij}) \in \mathbb{R}^{n \times p}\) that is not necessarily centered or scaled, the total variability is \[ \sigma_{\text{total}}^2=\sum\nolimits_{i=1}^n \sum\nolimits_{j=1}^p {x}_{ij}^2 \]
For PCA on \(\mathbf{X}=(x_{ij}) \in \mathbb{R}^{n \times p}\),
where \((d_i,\tilde{\mathbf{v}}_i)\) is an eigenpair of \({\mathbf{S}}={\mathbf{X}}^T{\mathbf{X}}\) that has eigenvalues \(d_1 \ge \cdots \ge d_r>0\) and \(r=\text{rank}(\mathbf{X})\)
(Cumulative) Proportion of variance explained by subspaces of different dimensions
> # standardized data
> Xsd = scale(USArrests,center=TRUE,scale=TRUE)
> sum(Xsd^2) # total variability of standardized data
[1] 196
> pcaXsd = prcomp(Xsd) # PCA for standardized data
> pcaXsd$rotation # directions or loading vectors
PC1 PC2 PC3 PC4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432
> # successive maximal lengths/variabilities of projections
> (nrow(USArrests)-1)*((pcaXsd$sdev)^2)
[1] 121.531837 48.498492 17.471596 8.498074
> # proportion of variance explained by each PC
> (nrow(USArrests)-1)*((pcaXsd$sdev)^2)/sum(Xsd^2)
[1] 0.62006039 0.24744129 0.08914080 0.04335752
> # cumulative proportion of variance explained
> cumsum((nrow(USArrests)-1)*((pcaXsd$sdev)^2))/sum(Xsd^2)
[1] 0.6200604 0.8675017 0.9566425 1.0000000
(Cumulative) Proportion of variance explained

Cuation: PCA is not designed for the purpose of pattern recognition but rather for providing optimal orthogonal directions of successive maximal data variability
In a biplot for PCA, there are 2 coordinate systems that share the same origin, such that
one coordinate system has \(x\)-coordinate for scores (i.e., entries of score vector) of PC \(i\) and \(y\)-coordinate for scores of another PC \(j\) and has the orientation and layout of the conventional \(x\)-\(y\) coordinate system
the other coordinate system has \(x\)-coordinate for loadings (i.e., entries of loading vector) of PC \(i\) and \(y\)-coordinate for loadings of another PC \(j\), where the \(x\)-axis tick marks are on the top and \(y\)-axis tick marks are on the right

The first loading vector places approximately equal weight on Assault, Murder, and Rape but much less weight on UrbanPop. So, the first PC roughly corresponds to a measure of overall rates of serious crimes
The second loading vector places most of its weight on UrbanPop and much less weight on the other three features Assault, Murder, and Rape. So, the second PC roughly corresponds to the level of urbanization of a state
Overall, crime-related variables (Murder, Assault, and Rape) are located close to each other, and UrbanPop variable is far from these three crime-related variables. This indicates that the crime-related variables are correlated with each other (i.e., states with high murder rates tend to have high assault and rape rates) and that UrbanPop variable is less correlated with the three crime-related variables (Murder, Assault, and Rape)
The above finding is consistent with the correlation plot for the 4 features
We can examine differences between the US states via the two PC score and loading vectors via the biplot:
5 clusters from hierarchical clustering, indicated by different shapes and superimposed on biplot; findings from biplot and clustering of states seem to be consistent with each other

Sample variance of each feature (where features are measured on different scales):
Murder Assault UrbanPop Rape
18.97047 6945.16571 209.51878 87.72916
If PCA is applied to unscaled data,
Assault since it has the largest sample variance (that greatly dominates those of the other 3 features)UrbanPop (since it has the second largest sample variance that considerably dominates those of the remaining 2 features)We do not want results of PCA to be dependent on magnitudes of feature measurements

[1] 0.6200604 0.8675017 0.9566425 1.0000000
If we know covariance matrix \(\Sigma\) of feature vector \(X\), then we can implement population version PCA using spectral decomposition of \(\Sigma\), and project \(X\) onto orthonormal eigenvectors of \(\Sigma\) to achieve maximal successive explained variabilities
If we do not know \(\Sigma\), then we apply data version PCA using spectral decomposition of \(\mathbf{S}=\mathbf{X}^T \mathbf{X}\) if \(\mathbf{X}\) is column-centered since \(\mathbf{S}\) is proportional to sample covariance matrix \(\hat{\Sigma}\) of \(X\), and project each observation \(x_i\) onto orthonormal eigenvectors of \(\mathbf{S}\)
Usually, orthonormal eigenvectors of \(\mathbf{S}=\mathbf{X}^T \mathbf{X}\) are regarded as an estimate (modulo \(\pm\) signs) of orthonormal eigenvectors of \(\Sigma\). So, data version PCA actually tries to estimate population version PCA in terms of principal components, directions and scores
When \(n \gg p\), this estimation is often very good, whereas when \(n \le p\), this estimation is often bad, and other ways (different than using spectral decomposition of sample covariance matrix) of estimating population version PCA should be used
Human cancer data:
We will pick observations for 3 caner types “BREAST”, “OVARIAN” and “LEUKEMIA”, and this gives a data matrix \(\mathbf{X} \in \mathbb{R}^{19 \times 6830}\)
Recall data matrix \(\mathbf{X} \in \mathbb{R}^{19 \times 6830}\), i.e., \(p=6830\) features and \(n=19\) samples/observations:
We are dealing with the so called “high-dimensional data”, where dimension reduction techniques such as PCA are even more important
Each loading vector has a lot of nonzero entries; there seem to be no dominant features (i.e., genes) in any direction

Cumulative “PVE”:
[1] 16.95366 27.33261 36.83971 44.15321 50.83002
[6] 56.55363 62.01891 66.77877 71.51763 75.68344
[11] 79.67929 83.60823 87.38751 90.98552 94.09808
[16] 96.83802 99.13764 100.00000 100.00000

PC1 PC2
BREAST -60.098790 1.588815
BREAST -31.889651 10.877669
BREAST -29.035831 6.582595
OVARIAN -19.721568 27.641286
OVARIAN -12.608223 14.464804
OVARIAN -26.962732 12.273994
OVARIAN -6.453167 9.465530
OVARIAN -22.091467 32.942808
OVARIAN -16.828185 19.572828
LEUKEMIA 17.967827 -20.708967
LEUKEMIA 40.596818 -8.642872
LEUKEMIA 48.904273 -5.795735
LEUKEMIA 66.947137 15.148187
LEUKEMIA 56.697268 15.532885
LEUKEMIA 23.671867 -8.962769
BREAST 9.829156 4.542899
BREAST -6.791608 5.910896
BREAST -17.605215 -66.753270
BREAST -14.527909 -65.681584
Large positive PC1 scores indicative of “LEUKEMIA”, and large positive PC2 scores of “OVARIAN”

Recall standardized data matrix \(\tilde{\mathbf{X}} \in \mathbb{R}^{n \times p}\)
Recall spectral decomposition of \(\tilde{\mathbf{S}}=\tilde{\mathbf{X}}^T \tilde{\mathbf{X}}\) as \[ \tilde{\mathbf{S}}\tilde{\mathbf{v}}_i = d_i \tilde{\mathbf{v}}_i \iff \tilde{\mathbf{S}} = \mathbf{V} \mathbf{D} \mathbf{V}^T \] for an orthogonal matrix \(\mathbf{V}=(\tilde{\mathbf{v}}_1,\ldots,\tilde{\mathbf{v}}_p) \in \mathbb{R}^{p\times p}\) and a diagonal matrix \(\mathbf{D}=\text{diag}\{d_1,\ldots,d_p\}\) such that \[ d_1 \ge \cdots \ge d_{r} > 0 =d_{r+1} = \cdots = d_{p} = 0 \] when \(\tilde{\mathbf{X}}\) has rank \(r\)
PCA sets \(\tilde{\mathbf{z}}_i =\tilde{\mathbf{X}}\tilde{\mathbf{v}}_i\) for \(1 \le i \le r\), where \(\tilde{\mathbf{z}}_i\) and \(\tilde{\mathbf{v}}_i\) are the score vector and loading vector for the \(i\)-th PC, respectively
Let \(\tilde{\mathbf{v}}_i=(\phi_{i1},\ldots,\phi_{ip})^T\) for a fixed \(i\). Recall the following:
If the \(s\)-th entry \(\phi_{is}\) of \(\tilde{\mathbf{v}}_i\) has absolute value that dominates those of other entries of \(\tilde{\mathbf{v}}_i\), then the projection of observation \(\tilde{x}_r\) onto \(\tilde{\mathbf{v}}_i\) satisfies \(\tilde{z}_{ri} \approx \tilde{x}_{rs}\phi_{is}\)
In this case, feature \(s\) contributes the dominant proportion of variation of standardized data \(\tilde{\mathbf{X}}\) along direction \(\tilde{\mathbf{v}}_i\)
If a majority of entries of a loading vector \(\tilde{\mathbf{v}}_i\) are zero, then we can easily identify dominant features along direction \(\tilde{\mathbf{v}}_i\), i.e., identify features that explain or account for most (or even all) of the variability of data \(\tilde{\mathbf{X}}\) along \(\tilde{\mathbf{v}}_i\)
However, for ordinary PCA, its PC’s are hard to interpret since each PC is a linear combination of all feature variables via a loading vector (i.e., coefficients of the linear combination are entries of a loading vector) and often a loading vector has many nonzero entries
Namely, for ordinary PCA, it is hard to identify a few features in a direction that accounts for most of the variability of data along that direction
The optimization problem in the previous slide is called an “elastic net regularization”, where
Caution: unlike ordinary PCA, the score vectors of SPCA are not necessarily uncorrelated
pitprops data is a correlation matrix that was calculated from \(n=180\) observations on \(p=13\) features
Overall a sparse PC explains a bit less variability than an ordinary PC but provides greater interpretability

Since the pitprops data is a correlation matrix but not a data matrix, a biplot cannot be created for SPCA in this setting. In general, a biplot cannot be created for PCA or SPCA if only a correlation or covariance matrix is available
When there are many features, a biplot may not be able to show loading vectors very clearly
Human cancer data:
We will pick observations for 3 caner types “BREAST”, “OVARIAN” and “LEUKEMIA”, and this gives a data matrix \(\mathbf{X} \in \mathbb{R}^{19 \times 6830}\)
Recall data matrix \(\mathbf{X} \in \mathbb{R}^{19 \times 6830}\), i.e., \(p=6830\) features and \(n=19\) samples:
Histogram of entries of a loading vector; each sparse loading vector has a lot of zero entries

Histogram of entries of a loading vector; each ordinary loading vector has a lot of nonzero entries

18 sparse PCs in total and adjusted CPVE:
[1] 0.04121285 0.06446548 0.08576216 0.10116359 0.11575015
[6] 0.12716561 0.13836691 0.14794329 0.15798456 0.16611261
[11] 0.17404920 0.18198781 0.18927400 0.19637057 0.20227463
[16] 0.20750115 0.21207035 0.21387115


> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12