Stat 437 Lecture Notes 6a

Xiongzhi Chen

Washington State University

Support vector machines (SVMs)

Nonlinear decision boundary

Need of a classifier with nonlinear decision boundary:

Enlarged feature space

Let us call a “linear classifier” a classifier with a linear decision boundary in the original feature space. One way to build a “nonlinear classifier” is

  • to enlarge the original feature space by including nonlinearly transformed feature variables
  • to apply a linear classifier, such as the support vector classifier (SVC) that is defined via the Euclidean inner product, to all feature variables in the enlarged feature space

This will give a decision boundary that is linear in the enlarged feature space but is nonlinear in the original feature space

SVC in enlarged feature space

We can build a support vector classifier (SVC) in an enlarged feature space as follows:

  • original feature space of dimension \(p\), generated by components of feature vector \(X=(X_1,\ldots,X_p)^T\)

  • enlarged feature space of dimension \(2p\), generated by components of \(X\) and of \(X^{\cdot 2}=(X_1^2,\ldots,X_p^2)^T\)

  • let \(x_i\) and \(x^{\cdot 2}{_{i}}\) be the \(i\)th observation for \(X\) and \(X^{\cdot 2}\), respectively

  • let \(y_i\) be the class label for \(x_i\)

SVC in enlarged feature space

  • find optimal hyperplane \(\hat{S}=\hat{S}(\hat{\alpha}_1,\hat{\alpha}_2,\hat{\beta}_0)\) in the enlarged feature space by solving \[ \max\{M: \beta_{0} \in \mathbb{R}, \alpha_1, \alpha_2 \in \mathbb{R}^p, \Vert (\alpha_1^T,\alpha_2^T)^T \Vert =1\} \] subject to \[ \begin{aligned} & \forall i, y_{i}\left( \langle x_{i}, \alpha_1 \rangle + \langle x^{\cdot 2}{_{i}}, \alpha_2 \rangle +\beta_{0}\right) \geq M\left( 1-\epsilon_{i}\right)\\ & \quad \quad \epsilon_{i}\geq 0; \sum\nolimits_{i=1}^n\epsilon_{i}\leq C \text{ for some } C>0 \\ \end{aligned} \]
  • the direction of \(\hat{S}\) is \[\hat{\alpha}=(\alpha_1^T,\alpha_2^T)^T \in \mathbb{R}^{2p}\]
  • the SVC induced by \(\hat{S}\) assigns \((x_0^T,(x^{\cdot 2}{_{0}})^T)^T\) to class \[ \hat{y}_0 = \operatorname{sgn}( \langle x_{0}, \hat{\alpha}_1 \rangle + \langle x^{\cdot 2}{_{0}}, \hat{\alpha}_2 \rangle +\hat{\beta}_{0} ) \]

Enlarged feature space

  • The original feature space of dimension \(p\), generated by components of feature vector \(X=(X_1,\ldots,X_p)^T\), can be enlarged in different ways

  • However, enlarging original feature space may lead to unmanageable computations in the optimization problem that gives the corresponding classifier, if too many transforms of feature variables are incorporated

  • We need a sufficiently versatile and computationally efficient way to enlarge a feature space, and one such is the support vector machine (SVM)

SVM in a nutshell

  • Support vector classifier (SVC) uses the Euclidean inner product on the feature space and solves the optimization problem to obtain the optimal hyperplane and its induced classifier

  • Support vector machine (SVM) replaces the Euclidean inner product on the feature space by a kernel and solves the same type of optimizatin problem to obtain an optimal hypersurface and its induced classifier, yielding a nonlinear decision boundary in the original feature space when needed

Inner product

An inner product on \(\mathbb{R}^p \times \mathbb{R}^p\) is a bivariate function \(H\) of two vectors that is, for \(\lambda,\gamma,r,s \in \mathbb{R}\) and \(\mathbf{a},\mathbf{b},\mathbf{c},\mathbf{d} \in \mathbb{R}^p\),

  • symmetric, i.e., \(H(\mathbf{a},\mathbf{b}) = H(\mathbf{b},\mathbf{a})\)
  • bi-linear, i.e., linearity in each argument, i.e., \[ \begin{aligned} & H(\lambda \mathbf{a} + \gamma \mathbf{c}, r \mathbf{b} + s \mathbf{d}) \\ = & \lambda r H( \mathbf{a}, \mathbf{b}) + \lambda s H( \mathbf{a}, \mathbf{d}) + \gamma r H(\mathbf{c}, \mathbf{b}) + \gamma s H(\mathbf{c}, \mathbf{d}) \end{aligned} \]

  • positive definite, i.e., \(H( \mathbf{a},\mathbf{a}) >0\) unless \(\mathbf{a}=0\)

The Euclidean inner product \(\langle \mathbf{a},\mathbf{b}\rangle = \sum_{i=1}^p a_i b_i\) for \(\mathbf{a}=(a_1,\ldots,a_p)^T,\mathbf{b}=(b_1,\ldots,b_p)^T \in \mathbb{R}^p\) is just one inner product among many inner products

Linear representation of SVC

  • For a support vector classifier (SVC), its optimal hyperplane \(\hat{S}=\hat{S}(\hat{\alpha},\hat{\beta}_0)\) has \(\hat{\alpha}=\sum_{i=1}^n \hat{a}_i y_i x_i\). Since Euclidean inner product \(\langle \cdot,\cdot\rangle\) is symmetric and bi-linear, we see \[ \langle x, \hat{\alpha} \rangle + \hat{\beta}_0 = \sum_{i=1}^n \hat{a}_i y_i \langle x,x_i \rangle + \hat{\beta}_0 \]

  • Let \[f(x; \hat{\alpha},\hat{\beta_0}) = \sum_{i=1}^n \hat{a}_i y_i \langle x,x_i \rangle + \hat{\beta}_0\] Then \(\hat{S}\) is the solution in \(x \in \mathbb{R}^p\) to \(f(x; \hat{\alpha},\hat{\beta_0})=0\), and \(f\) is called a linear representation for the SVC induced by \(\hat{S}\)

Support vector machines

  • If we replace Euclidean inner product \(\langle \cdot,\cdot\rangle\) in the linear representation for SVC as \[f(x; \hat{\alpha},\hat{\beta_0}) = \sum_{i=1}^n \hat{a}_i y_i \langle x,x_i \rangle + \hat{\beta}_0\] by a symmetric, positive definite, bivariate function \(K(\cdot,\cdot)\) called a kernel, then we obtain a generalization \[ f_K(x; \hat{\alpha},\hat{\beta_0}) = \sum_{i=1}^n \hat{a}_i y_i K(x,x_i) + \hat{\beta}_0 \]
  • The resulting classifier that assigns \(x_0\) to class \[ \hat{y}_0 = \operatorname*{sgn}(f_K(x_0; \hat{\alpha},\hat{\beta_0})) \] is called a support vector machine (SVM)

Some kernels

For \(\mathbf{a},\mathbf{b} \in \mathbb{R}^p\), some kernels include:

  • Euclidean inner product \(K(\mathbf{a},\mathbf{b})=\langle \mathbf{a},\mathbf{b}\rangle\), which is called a linear kernel and gives the support vector classifier(SVC)
  • Polynomial kernel of degree \(d \ge 1\) as \[K(\mathbf{a},\mathbf{b})= (1+ \langle \mathbf{a},\mathbf{b}\rangle)^d\] When \(d=1\), the resulting classifier is the SVC
  • Radial kernel as \[ K(\mathbf{a},\mathbf{b})= \exp(-\gamma \Vert \mathbf{a} -\mathbf{b} \Vert^2) \text{ for some } \gamma >0, \] where \(\Vert \mathbf{a} -\mathbf{b} \Vert=d_{\mathbb{R}^p}(\mathbf{a},\mathbf{b})\) is Euclidean distance. For a new observation \(x_0\), \(K(x_0,x_i)\) is small when \(d_{\mathbb{R}^p}(x_0,x_i)\) is large. So, only nearby observations to \(x_0\) affect the predicted class label for \(x_0\)

SVM with different kernels

SVM with different kernels

Figure 12.3 of Supplementary Text:

Advantages of kernels

  • In essence, an SVM implicitly enlarges the feature space via the use of a kernel

  • One advantage of using a kernel over explicitly enlarging feature space is that we need only compute \(K(x_i,x_{i'})\) for all \(\binom{n}{2}\) distinct pairs \((i,i')\), without explicitly working in the enlarged feature space

  • This is important because in many applications of SVMs, the enlarged feature space can be so large that computations are intractable

Optimization for SVM

  • From Chapter 12 of Supplementary Text, we see that an SVM maximizes the “Lagrange dual function” \[ L_{D} = \sum_{i=1}^n a_{i} - 2^{-1}\sum_{i=1}^n \sum_{i'=1}^n a_{i} a_{i'}y_{i} y_{i'} K( x_i, x_{i'}) \] with constraint \(\sum_{i=1}^n a_i y_i =0\) and \(0 \le a_i \le C\)

  • In the optimal solution \[ f_K(x; \hat{\alpha},\hat{\beta_0}) = \sum_{i=1}^n \hat{a}_i y_i K(x,x_i) + \hat{\beta}_0, \] an \(x_i\) such that \(\hat{a}_i \ne 0\) is a support vector

SVM application: data set

  • Heart data: 13 features such as Age, Sex and Chol (serum cholestoral in mg/dl), and \(303\) observations
  • Target: to predict whether an individual has heart disease, i.e., AHD=Yes or AHD=No
  Age Sex Chol AHD
1  63   1  233  No
2  67   1  286 Yes
3  67   1  229 Yes
  • LDA and SVC compute scores of the form \[\hat{f}(X)=\hat{\beta}_0 + \langle X,\hat{\alpha} \rangle\]
  • For any cutoff \(t\), \(X\) is classified into AHD=Yes or AHD=No depending on whether \(\hat{f}(X)<t\) or \(\hat{f}(X) \ge t\)

SVM: ROC curve on training set

SVC seems to be better than LDA; SVM with \(\gamma=0.1\) radial kernel seems to give perfect ROC curve

SVM: ROC curve on test set

90 test observations; SVC seems to be better than LDA; SVM with \(\gamma=0.1\) radial kernel seems to be worst

Support vector machine: miscellaneous

SVM for multiclass settings

When there are \(s>2\) classes, we can do either

  • One-versus-One classification approach: construct \(\binom{s}{2}\) SVMs, each of which compares a pair of classes. A test observation \(x_0\) is assigned to the class to which \(x_0\) is the mostly frequently assigned by the \(\binom{s}{2}\) SVMs
  • One-versus-All classification approach: construct \(s\) SVMs as \[ f_{K,r}(x; \hat{\alpha}_r,\hat{\beta}_{r0}) = \sum_{i=1}^n \hat{a}_{ri} y_i K(x,x_i) + \hat{\beta}_{r0}, r=1,\ldots,s, \] each of which compares one of the \(s\) classes with the remaining \(s-1\) classes. \(x_0\) is assigned to class \[ \tilde{k}=\operatorname{argmax}_{1\le k \le s} f_{K,k}(x_0;\hat{\alpha}_k,\hat{\beta}_{k0}) \]

SVM and hinge loss

  • Recall the linear representation for support vector classifier (SVC) \[f(x; {\beta}, {\beta_0}) = \langle x, \beta \rangle + {\beta}_0\] with \(\beta=(\beta_1,\ldots,\beta_p)^T\)

  • The optimization problem that produces SVC is equivalent to \[ \min_{\beta_0 \in \mathbb{R},\alpha \in \mathbb{R}^p} \left\{ \overbrace{\sum_{i=1}^n \underbrace{\max[0,1-y_i f(x_i; {\beta}, {\beta_0})]}_{\text{hinge loss}}}^{\text{empirical loss } L(\mathbf{X},\mathbf{y})} + \overbrace{\underbrace{\lambda \Vert \beta \Vert^2}_{\text{ridge penalty}}}^{\text{penalty } P(\beta)} \right\} \]

Hinge loss and logistic regression

SVM and logistic regression

  • The hinge loss \(l(t)=\max(0,1-t)\) is zero if \(t \ge 1\), i.e., the loss is zero if \[y_i(\langle x_i, \beta \rangle + {\beta}_0) \ge 1,\] i.e., if \(x_i\) is on the correct side of the margin

  • But the loss function for logistic regression is not exactly zero anywhere, and it is very small for observations that are far from the decision boundary

  • Hence, logistic regression and SVMs often give very similar results. When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred

License and session Information

License

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] knitr_1.21

loaded via a namespace (and not attached):
 [1] compiler_3.5.0  magrittr_1.5    tools_3.5.0    
 [4] htmltools_0.3.6 revealjs_0.9    yaml_2.2.0     
 [7] Rcpp_1.0.0      stringi_1.2.4   rmarkdown_1.11 
[10] stringr_1.3.1   xfun_0.4        digest_0.6.18  
[13] evaluate_0.12