
Dimension reduction
Dimension reduction techniques can be roughly categorized into:
We will cover
The contents are adapted or adopted from:
A linear space \(\mathcal{V}\) over the set \(\mathbb{R}\) of real numbers is a set \(\mathcal{V}\) of “vectors” such that \[a\mathbf{v}+b\mathbf{w} \in \mathcal{V}\] for all \(\mathbf{v}, \mathbf{w} \in \mathcal{V}\) and \(a,b \in \mathbb{R}\)
\(r\) vectors \(\mathbf{v}_1,\ldots,\mathbf{v}_r \in \mathcal{V}\) are linearly independent if and only if the linear combination \[a_1 \mathbf{v}_1 + \ldots + a_r \mathbf{v}_r=0\] for some \(a_1,\ldots,a_r \in \mathbb{R}\) implies that \(a_i=0\) for all \(i=1,\ldots,r\)
A linear space \(\mathcal{V}\) is said to have dimension \(s >0\), denoted by \(\text{dim}(\mathcal{V})=s\), if and only if there exit \(s\) linearly independent vectors \(\mathbf{w}_1,\ldots, \mathbf{w}_s \in \mathcal{V}\) such that any vector \(\mathbf{w}\) in \(\mathcal{V}\) can be written as a linear combination of these \(s\) vectors, i.e., any \[\mathbf{w}=b_1 \mathbf{v}_1 + \ldots + b_s \mathbf{v}_s\] for some \(b_1,\ldots,b_s \in \mathbb{R}\). The \(s\) vectors \(\mathbf{v}_1,\ldots, \mathbf{v}_s\) is called a basis for \(\mathcal{V}\)
If a linear space \(\mathcal{V}\) has dimension \(s\), then any set of \(s\) linearly independent vectors of \(\mathcal{V}\) is a basis for \(\mathcal{V}\)
There are many other (not necessarily orthogonal or orthonormal) bases for \(\mathbb{R}^p\)
A Euclidean space has many interesting and agreeable properties and is where most statistical learning tasks happen
Given \(d\) vectors \(\mathbf{w}_1,\ldots, \mathbf{w}_d \in \mathcal{V}\), their linear span (or span for short), denoted by \(\text{span}\left(\{\mathbf{w}_i\}_{i=1}^d\right)\), is the linear subspace of \(\mathcal{V}\) that contains all linear combinations of these \(d\) vectors, i.e., \[ \begin{aligned} \text{span}\left(\{\mathbf{w}_i\}_{i=1}^d\right)= \{\mathbf{w} \in \mathcal{V}: \mathbf{w}=b_1 \mathbf{v}_1 + \ldots + b_s \mathbf{v}_s,\\ \exists b_1, \ldots,b_s \in \mathbb{R}\} \end{aligned} \]
A vector space \(\mathcal{V}\) is always the linear span of any of its basis
Two bases \(\{\mathbf{e}_1,\mathbf{e}_2\}\) and \(\{\mathbf{b}_1,\mathbf{b}_2\}\) for \(\mathbb{R}^2\), and their linear spans are just \(\mathbb{R}^2\) 
For \(\mathbf{a} \in \mathbb{R}^p\), \(\text{span}\left(\mathbf{a}\right)\) is the line in \(\mathbb{R}^p\) that passes through \(\mathbf{a}\) and the origin \(0\)
For linearly independent \(\mathbf{a}, \mathbf{b}\in \mathbb{R}^3\), \(\text{span}\left(\mathbf{a},\mathbf{b}\right)\) is the plane in \(\mathbb{R}^3\) that passes through \(\mathbf{a}\), \(\mathbf{b}\) and \(0\)
For linearly dependent \(\mathbf{a}, \mathbf{b}\in \mathbb{R}^p\), \(\text{span}\left(\mathbf{a},\mathbf{b}\right)\) is the line in \(\mathbb{R}^p\) that passes through \(\mathbf{a}\), \(\mathbf{b}\) and \(0\)
Note: the concept of linear space will be used by the 2nd interpretation of principal component analysis, and the concept of orthogonality and basis will be used in the 1st interpretation of principal component analysis
Consider an \(n \times p\) matrix \(\mathbf{A}=(a_{ij})\), a vector \(\mathbf{v} \in \mathbb{R}^p\) and a vector \(\mathbf{z} \in \mathbb{R}^n\). Then
via right action \(\mathbf{A}\) maps \(\mathbf{z}\) to a vector in \(\mathbb{R}^p\), i.e., \[ \mathbf{A}: \mathbb{R}^n \to \mathbb{R}^p \quad \text{such that} \quad \mathbf{z} \mapsto \mathbf{z}^T\mathbf{A} \]
via left action \(\mathbf{A}\) maps \(\mathbf{v}\) to a vector in \(\mathbb{R}^n\), i.e., \[ \mathbf{A}: \mathbb{R}^p \to \mathbb{R}^n \quad \text{such that} \quad \mathbf{v} \mapsto \mathbf{A}\mathbf{v} \]
if \(p=n\), then either left or right action induces a mapping \(\mathbf{A}: \mathbb{R}^p \to \mathbb{R}^p\)
Consider a \(p \times p\) matrix \(\mathbf{A}\) and the mapping \(\mathbf{A}: \mathbb{R}^p \to \mathbb{R}^p\) via left action. If a nonzero vector \(\mathbf{v} \in \mathbb{R}^p\) is such that \[ \mathbf{A}\mathbf{v} = \lambda \mathbf{v} \] for some number \(\lambda\), then \(\lambda\) is called an eigenvalue of \(\mathbf{A}\), \(\mathbf{v}\) is an eigenvector associated with \(\lambda\), and \((\lambda,\mathbf{v})\) an eigenpair for \(\mathbf{A}\)
Let \((\lambda,\mathbf{v})\) be an eigenpair for the \(p\times p\) matrix \(\mathbf{A}\), i.e., \[\mathbf{A}\mathbf{v} = \lambda \mathbf{v}\]
Invariance: in summary, the linear subspace spanned by \(\mathbf{v}\) is invariant under the mapping \(\mathbf{A}\) if \(\mathbf{v}\) is an eigenvector of \(\mathbf{A}\)
Consider a \(p\times p\) matrix \(\mathbf{A}\), a vector \(\mathbf{v}\), and \(\tilde{\mathbf{v}}=\mathbf{A}\mathbf{v}\)
i.e., \[ \mathbf{A}\mathbf{v} = \lambda \mathbf{v} \implies \Vert \mathbf{A}\mathbf{v} \Vert = \vert\lambda\vert \Vert \mathbf{v} \Vert \]
Scaling: in summary, if \((\lambda,\mathbf{v})\) is an eigenpair for \(\mathbf{A}\), then the Euclidean norm (or length) of the image \(\mathbf{A}\mathbf{v}\) of \(\mathbf{v}\) is equal to \(\vert\lambda\vert\) times the Euclidean norm of \(\mathbf{v}\)
Consider a \(p\times p\) matrix \(\mathbf{A}\), a vector \(\mathbf{v}\), and \(\tilde{\mathbf{v}}=\mathbf{A}\mathbf{v}\)
Recall the squared Euclidean norm of \(\tilde{\mathbf{v}}\) is \[ \begin{aligned} \Vert \tilde{\mathbf{v}} \Vert^2 =\langle\tilde{\mathbf{v}},\tilde{\mathbf{v}} \rangle=\tilde{\mathbf{v}}^T\tilde{\mathbf{v}} = (\mathbf{A}\mathbf{v})^T\mathbf{A}\mathbf{v} = \mathbf{v}^T (\mathbf{A}^T\mathbf{A})\mathbf{v} \end{aligned} \]
If \(\mathbf{A}\) is an orthogonal matrix, i.e., \[\mathbf{A}\mathbf{A}^T=\mathbf{A}^T\mathbf{A}=\mathbf{I}_p,\] where \(\mathbf{I}_s\) is the identity matrix of dimension \(s\), then \[ \Vert \tilde{\mathbf{v}} \Vert^2 = \mathbf{v}^T (\mathbf{A}^T\mathbf{A})\mathbf{v} = \mathbf{v}^T \mathbf{I}_p\mathbf{v} = \Vert \mathbf{v} \Vert^2 \]
Length invariance: in summary, the Euclidean norm (or length) of the image \(\mathbf{A}\mathbf{v}\) of \(\mathbf{v}\) is equal to the Euclidean norm of \(\mathbf{v}\) if \(\mathbf{A}\) is an orthogonal matrix
Let \(\mathbf{A}\) be a \(p\times p\) orthogonal matrix, i.e., \(\mathbf{A}\mathbf{A}^T=\mathbf{I}_p\), and let \(\{\mathbf{a}_i\}_{i=1}^p\) be the rows of \(\mathbf{A}\), and \(\{\mathbf{b}_i\}_{i=1}^p\) the columns of \(\mathbf{A}\)
Orthonormal basis: the rows or columns of a \(p\times p\) orthogonal matrix form an orthonormal basis for \(\mathbb{R}^p\)
For an \(n \times p\) matrix \(\mathbf{A}\), consider the mapping \[ \mathbf{A}: \mathbb{R}^p \to \mathbb{R}^n \quad \text{such that} \quad \mathbf{v} \mapsto \mathbf{A}\mathbf{v} \] via left action
Then there are an orthonormal basis \(\{\mathbf{u}_i\}_{i=1}^n\) for \(\mathbb{R}^n\) and an orthonormal basis \(\{\mathbf{v}_i\}_{i=1}^p\) for \(\mathbb{R}^p\), such that \[ \left\{ \begin{array} {ll} \mathbf{A}\mathbf{v}_i= d_i\mathbf{u}_i & \forall i=1,\ldots,\min\{n,p\} \\ \mathbf{A}\mathbf{v}_i=0 & \forall i> \min\{n,p\} \end{array}\right. \]
The above representation is called the singular value decompsosition (SVD) of \(\mathbf{A}\). Each \(\mathbf{u}_i\) and \(\mathbf{v}_i\) is called a left and right singular vector, respectively, associated with singular value \(d_i\)
Let \(\mathbf{V}=(\mathbf{v}_1,\ldots,\mathbf{v}_p)\), \(\mathbf{U}=(\mathbf{u}_1,\ldots,\mathbf{u}_n)\) and \[ \mathbf{D}=\left( \begin{array} [c]{cccc} d_{1} & 0 & \cdots & 0\\ 0 & \ddots & 0 & \vdots\\ \vdots & & d_{\min\{n,p\}} & 0\\ 0 & \cdots & 0 & 0 \end{array} \right) \in \mathbb{R}^{n \times p} \] Then \(\mathbf{U} \mathbf{U}^T= \mathbf{I}_n\) and \(\mathbf{V} \mathbf{V}^T= \mathbf{I}_p\)
SVD for \(\mathbf{A}\) in matrix form is \[ \mathbf{A} = \mathbf{U} \mathbf{D} \mathbf{V}^T \]
Note: SVD is very fundamental!
For a \(p \times p\) symmetric matrix \(\mathbf{A}=(a_{ij})\), i.e., \(\mathbf{A}^T=\mathbf{A}\), we have the spectral decomposition theorem:
Namely, there exist an orthonormal basis \(\{\mathbf{v}_i\}_{i=1}^p\) for \(\mathbb{R}^p\) and scalars \(\{d_i\}_{i=1}^p\) such that \(\mathbf{A}\mathbf{v}_i = d_i \mathbf{v}_i\)
Note: spectral decomposition is very fundamental!
A \(p \times p\) symmetric matrix \(\mathbf{A}\) is said to be positive semi-defininte and denoted by \(\mathbf{A} \succeq 0\) if \[ \mathbf{v}^T \mathbf{A} \mathbf{v} \ge 0, \quad \forall \text{ nonzero } \mathbf{v} \in \mathbb{R}^p \] If \(\mathbf{v}^T \mathbf{A} \mathbf{v} > 0,\forall \text{ nonzero } \mathbf{v} \in \mathbb{R}^p\), then \(\mathbf{A}\) is said to be positive definite and denoted by \(\mathbf{A} \succ 0\)
For \(r\) stochastically linearly independent random variables \(X_1,\ldots,X_s\), i.e., \[ a_1 X_1 + \ldots + a_s X_s=0 \implies a_1=\ldots=a_s=0 \] with probability \(1\), their covariance matrix \(\Sigma \succ 0\); otherwise, \(\Sigma \succeq 0\)
Spectral decomposition for a \(p \times p\) symmetric matrix \(\mathbf{A}\) as \[ \mathbf{A}\mathbf{v}_i = d_i \mathbf{v}_i \iff \mathbf{A} = \mathbf{V} \mathbf{D} \mathbf{V}^T \] for an orthogonal matrix \(\mathbf{V}=(\mathbf{v}_1,\ldots,\mathbf{v}_p)\) and a diagonal matrix \(\mathbf{D}=\text{diag}\{d_1,\ldots,d_p\}\) implies:
Note: positive semi-definiteness is very fundamental
Consider a \(p \times p\) symmetric matrix \(\mathbf{A}\), a vector \(\mathbf{v} \in \mathbb{R}^p\), and its image \(\tilde{\mathbf{v}}=\mathbf{A}\mathbf{v} \in \mathbb{R}^p\)
Summary: a symmetric matrix as a mapping has a very intuitive geometric interpretation
Consider an \(n \times p\) matrix \(\mathbf{A}\), a vector \(\mathbf{v} \in \mathbb{R}^p\), and its image \(\tilde{\mathbf{v}}=\mathbf{A}\mathbf{v} \in \mathbb{R}^n\)
For \(\mathbf{B} \in \mathbb{R}^{p \times p}\) and \(\mathbf{B} \succeq 0\),
Note: Write \(d_1=\Vert \mathbf{B}\Vert_{\text{op}}\). Then \(\Vert \mathbf{B}\Vert_{\text{op}}\) for \(\mathbf{B} \succeq 0\) is the maximal scaling of the length of a unit vector
For a \(p \times p\) matrix \(\mathbf{A}=(a_{ij})\), its trace is defined as \[ \text{tr}(\mathbf{A})= \sum\nolimits_{i=1}^p a_{ii} \] Namely, trace is the sum of diagonal entries
For an \(n \times p\) matrix \(\mathbf{A}=(a_{ij})\), its Frobenius norm is defined as \[ \Vert \mathbf{A} \Vert_F = \sqrt{\sum\nolimits_{i=1}^n \sum\nolimits_{j=1}^p a_{ij}^2}=\sqrt{\text{tr}\left(\mathbf{A}^T \mathbf{A}\right)}=\sqrt{\text{tr}\left(\mathbf{A} \mathbf{A}^T\right)} \] Namely, the Frobenius norm of \(\mathbf{A}\) is the Euclidean norm of the vector obtained by stacking rows or columns of \(\mathbf{A}\)
Note: trace is a very fundamental
Spectral decomposition and maximal scaled length are the methodology and theory behind both the population version and data version of principal data analysis (PCA)
Singular value decomposition and maximal scaled length are the methodology and theory behind PCA as the best linear approximate to a data matrix under the Frobenius norm among all subspaces of dimensions upper bounded by a positive number
Without sufficient understanding of the key concepts in linear algebra that have been discussed, one may have considerable difficulty well understanding, applying and interpreting PCA
Note: \(\mathbb{E}\) denotes “expectation” and \(\text{Var}\) “variance”
For \(p\) unit vectors \(\mathbf{v}_i=(\phi_{i1},\ldots,\phi_{ip})^T \in \mathbb{R}^p\) (i.e., \(\Vert \mathbf{v}_i\Vert=1\)), define linear combinations of \(X_j\)’s as \[ Z_i = \underbrace{\mathbf{v}_i^T X=\langle \mathbf{v}_i, X \rangle =\sum_{j=1}^p \phi_{ij}X_j}_{\text{definition of Euclidean inner product}}\in \mathbb{R} \]
Note: since \(\Vert \mathbf{v}_i\Vert=1\), we see \(Z_i=\langle \mathbf{v}_i, X \rangle\) is the projection of \(X\) onto \(\mathbf{v}_i\)
Note: \(\text{Cov}\) denotes “covariance”
Feature vector \(X=(X_1,\ldots,X_p)^T\) with \(\mathbb{E}(X)=0\) and positive definite covariance matrix \(\Sigma \succ 0\)
Projections \[ Z_i =\mathbf{v}_i^T X \in \mathbb{R} \quad \text{with} \quad \Vert \mathbf{v}_i\Vert=1, \] such that \[ \left\{ \begin{array} {ll} \mathbb{E}(Z_i)=0, & \forall i=1,\ldots,p \\ \delta_i=\text{Var}(Z_i)=\mathbf{v}_i^T \Sigma \mathbf{v}_i, &\forall i=1,\ldots,p\\ \delta_{ij}=\text{Cov}(Z_i,Z_j)=\mathbf{v}_i^T \Sigma \mathbf{v}_j, & \forall i\ne j \end{array}\right. \]
PCA aims to find \(p\) unit vectors \(\tilde{\mathbf{v}}_i \in \mathbb{R}^p,i=1,\ldots,p\) such that the projections \[ \tilde{Z}_i =\tilde{\mathbf{v}}_i^T X \in \mathbb{R} \quad \text{with} \quad \Vert \tilde{\mathbf{v}}_i\Vert=1, \] achieve
Let \(\Sigma\) have spectral decomposition as \[ \Sigma \tilde{\mathbf{v}}_i = d_i \tilde{\mathbf{v}}_i \iff \Sigma = \mathbf{V} \mathbf{D} \mathbf{V}^T \] for an orthogonal matrix \(\mathbf{V}=(\tilde{\mathbf{v}}_1,\ldots,\tilde{\mathbf{v}}_p)\) and a diagonal matrix \(\mathbf{D}=\text{diag}\{d_1,\ldots,d_p\}\)
Since \(\delta_i=\text{Var}(Z_i)=\mathbf{v}_i^T \Sigma \mathbf{v}_i\), the maximal scaled length lemma implies that \[ \begin{aligned} d_i =\text{Var}(\tilde{\mathbf{v}}_i^T X)=\max \left\{\text{Var}({Z}_i): {Z}_i={\mathbf{v}}_i^T X, \Vert {\mathbf{v}}_i\Vert=1,\right.\\ \quad \quad \quad \left. \langle \mathbf{v}_i,\mathbf{v}_j\rangle=0, i \ne j \right\} \end{aligned} \]
Namely, successive maximal variances is achieved by setting \[ \tilde{Z}_i=\tilde{\mathbf{v}}_i^T X \] whose variance is the \(i\)-th largest eigenvalue \(d_i\) of \(\Sigma\), where \(\tilde{\mathbf{v}}_i\) is a unit eigenvector associated with \(d_i\)
Since \(\tilde{\mathbf{v}}_i^T \tilde{\mathbf{v}}_j=0\) for \(i\ne j\), the covariance \[ \text{Cov}(\tilde{Z}_i,\tilde{Z}_j)=\tilde{\mathbf{v}}_i^T \Sigma \tilde{\mathbf{v}}_j=\tilde{\mathbf{v}}_i^T d_j \tilde{\mathbf{v}}_j=0. \] Namely, \(\tilde{Z}_i\)’s are mutually uncorrelated
Each such \(\tilde{Z}_i\) is called a principal component, and \(\tilde{\mathbf{v}}_i\) the loading vector for \(\tilde{Z}_i\)
Summary: PCA projects feature vector \(X\) successively onto unit eigenvectors of the covariance matrix \(\Sigma\) of \(X\) that are associated with decreasingly ordered eigenvalues of \(\Sigma\)
PCA Procedure (population version):
Obtain spectral decomposition \(\Sigma = \mathbf{V} \mathbf{D} \mathbf{V}^T\) for an orthogonal matrix \(\mathbf{V}=({\mathbf{v}}_1,\ldots,{\mathbf{v}}_p)\) and a diagonal matrix \(\mathbf{D}=\text{diag}\{d_1,\ldots,d_p\}\) such that \((d_i,{\mathbf{v}}_i)\) is an eigenpair for \(\Sigma\) and that \(d_1 \ge d_2 \ge \ldots \ge d_p > 0\)
Define optimal projections \(Z_i={\mathbf{v}}_i^T X, i=1,\ldots,p\)
“PCA for a data matrix (referred to as data version PCA)” applies to the data matrix a family of projections in an optimal way:
PCA aims to find \(q \le r=\text{rank}(\mathbf{X})\) projections \[ \tilde{\mathbf{z}}_i =\mathbf{X}\tilde{\mathbf{v}}_i \in \mathbb{R}^n \quad \text{with} \quad \tilde{\mathbf{v}}_i\in \mathbb{R}^p,\Vert \tilde{\mathbf{v}}_i\Vert=1 \] of \(\mathbf{X}=(x_{ij})\) to achieve
Note: general facts are stated above
Setting \(\tilde{\mathbf{z}}_i =\mathbf{X}\tilde{\mathbf{v}}_i\) for \(1 \le i \le r\) gives
Note: \(\Vert \tilde{\mathbf{z}}_i \Vert^2\) is the variability of \(\mathbf{X}\) explained by \(\tilde{\mathbf{z}}_i\)
The \(i\)-th optimal projection \(\tilde{\mathbf{z}}_i =\mathbf{X}\tilde{\mathbf{v}}_i \in \mathbb{R}^n\) where \[ \tilde{\mathbf{v}}_i \in \mathbb{R}^p, \Vert \tilde{\mathbf{v}}_i\Vert=1, \langle \tilde{\mathbf{v}}_i,\tilde{\mathbf{v}}_j \rangle=0,i \ne j \] is called the \(i\)-th principal component score vector, and its entries are called the \(i\)th principal component scores
The \(k\)-th entry of \(\tilde{\mathbf{z}}_i\) is \[ \tilde{z}_{ki}=x_k \tilde{\mathbf{v}}_i = \langle x_k^T, \tilde{\mathbf{v}}_i\rangle, 1\le k \le n \] and is the projection of the \(k\)-th observation \(x_k\) (a row vector) onto \(\tilde{\mathbf{v}}_i\)
The unit vector \(\tilde{\mathbf{v}}_i\) is called the loading vector or direction for the \(i\)-th principal component
Given \(n \times p\) data matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) with \(r=\text{rank}(\mathbf{X})\), whose \(i\)-th row is observation \(x_i\) on \(p\) features
Let \(\mathbf{S}=\mathbf{X}^{T}\mathbf{X}\) have spectral decomposition as \[ \mathbf{S}\tilde{\mathbf{v}}_i = d_i \tilde{\mathbf{v}}_i \iff \mathbf{S} = \mathbf{V} \mathbf{D} \mathbf{V}^T \] for an orthogonal matrix \(\mathbf{V}=(\tilde{\mathbf{v}}_1,\ldots,\tilde{\mathbf{v}}_p) \in \mathbb{R}^{p \times p}\) and a diagonal matrix \(\mathbf{D}=\text{diag}\{d_1,\ldots,d_p\}\) such that \[ d_1 \ge \cdots \ge d_{r} >0 \] and \[ d_{r+1} = \cdots = d_{p} = 0 \]
Set \(\tilde{\mathbf{z}}_i =\mathbf{X}\tilde{\mathbf{v}}_i\) for \(1 \le i \le r\)
The presentation of data and population versions of PCA here are conceptually the same as that in Section 10.2 of the Text, except the following:
Here vector and matrix notations are used, and more statistical and geometrical interpretations of key steps for PCA are provided
Here a systematic treatment of linear dimension reduction based on spectral decomposition is provided
Here \(n^{-1}\) given in the optimization settings in Section 10.2 of the Text is omitted, which does not affect the optimal solutions at all
PCA provides the best linear approximate to observations in a data matrix \(\mathbf{X}\) in Frobenius norm among all subspaces of dimensions less than or equal to a given number
This interpretation of PCA can be extended to propose and conduct nonlinear dimension reduction if we do not restrict ourselves to use linear subspaces
\(p\) stochastically linearly independent features \(X_1,\ldots,X_p\) and feature vector \(X=(X_1,\ldots,X_p)^T\)
\(n\) linearly independent observations \(x_i \in \mathbb{R}^p\) on \(X\), and data matrix \(\mathbf{X}=(x_{ij}) \in \mathbb{R}^{n \times p}\) whose \(i\)-th row is \(x_i\) (i.e., each column of \(\mathbf{X}\) is a feature and each row of \(\mathbf{X}\) is an observation)
Features are centered, i.e., \(\mathbf{X}\) is column-centered such that \(n^{-1}\sum_{i=1}^n x_{ij}=0\) for all \(1 \le j \le p\)
Consider the rank \(q\) linear model \[ f\left(\boldsymbol{\beta}\right) = \boldsymbol{\mu} +\mathbf{V}_{q} \boldsymbol{\beta} \] for representing \(\left\{ x_{i}\right\}_{i=1}^{n}\), where
We want to find \(\boldsymbol{\beta}_i \in\mathbb{R}^{q}\) such that \(f\left(\boldsymbol{\beta}_i\right)\) reconstructs \(x_{i}\) for each \(i=1,\ldots,n\) in some optimal sense
Data matrix \(\mathbf{X}=(x_{ij}) \in \mathbb{R}^{n \times p}\) has singular value decompsosition (SVD) by:
Recall \(\mathbf{V}=(\mathbf{v}_1,\ldots,\mathbf{v}_p)\) and \(\mathbf{U}=(\mathbf{u}_1,\ldots,\mathbf{u}_n)\). Set \[ \mathbf{D}=\left( \begin{array} [c]{cccc} d_{1} & 0 & \cdots & 0\\ 0 & \ddots & 0 & \vdots\\ \vdots & & d_{\min\{n,p\}} & 0\\ 0 & \cdots & 0 & 0 \end{array} \right) \in \mathbb{R}^{n \times p} \]
Then SVD for \(\mathbf{X}\) in matrix form is \[ \mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^T \]
Recall the SVD \(\mathbf{X}=\mathbf{UDV}^{T}\)
The SVD of \(\mathbf{X}\) gives the spectral decomposition of \(\mathbf{X}^{T}\mathbf{X}\) as \[ \mathbf{X}^{T}\mathbf{X}=\mathbf{V}\mathbf{D}^T\mathbf{U}^{T}\mathbf{UDV}^{T}=\mathbf{V}\mathbf{D}^T\mathbf{D}\mathbf{V}^{T} \]
So, each eigenvalue \(\lambda_{i}\) of \(\mathbf{X}^{T}\mathbf{X}\) satisfies \(\lambda_{i}=d_{i}^{2}\) and the eigenvector \(\mathbf{a}_{i}\) of \(\mathbf{X}^{T}\mathbf{X}\) can be identified with the right singular vector \(\mathbf{v}_{i}\)
Note: SVD and spectral decomposition are intrinsically related
The SVD \(\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^T\) with \(\mathbf{V}=(\mathbf{v}_1,\ldots,\mathbf{v}_p)\) and some involving algebraic arguments imply:
Recall the SVD \(\mathbf{X}=\mathbf{U}\mathbf{D}\mathbf{V}^{T}\)
The optimal subspace is the column space of \(\mathbf{\hat{V}}_{q}\), the \(p\times q\) matrix that contains the first \(q\) columns of \(\mathbf{V}\) (that are associated with the \(q\) largest eigenvalues of \(\mathbf{X}^T\mathbf{X}\))
The optimal \(\hat{\boldsymbol{\beta}}_{i}=\mathbf{\hat{V}}_{q}^{T}x_{i}^T \in \mathbb{R}^q\)
Each column of \(\mathbf{UD}\) is called a “principal component score vector” of \(\mathbf{X}\) since \(\mathbf{X}\mathbf{V}=\mathbf{U}\mathbf{D}\)
The fitted rank \(q\) linear model is \[x_{i}^T\approx\bar{x}^T+\mathbf{\hat{V}}_{q}\mathbf{\hat{V}}_{q}^{T}\left( x_{i}^T-\bar{x}^T\right)\]
Directions are unit eigenvectors; scores are projections

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12