
Need of a classifier with nonlinear decision boundary:

Let us call a “linear classifier” a classifier with a linear decision boundary in the original feature space. One way to build a “nonlinear classifier” is
This will give a decision boundary that is linear in the enlarged feature space but is nonlinear in the original feature space
We can build a support vector classifier (SVC) in an enlarged feature space as follows:
original feature space of dimension \(p\), generated by components of feature vector \(X=(X_1,\ldots,X_p)^T\)
enlarged feature space of dimension \(2p\), generated by components of \(X\) and of \(X^{\cdot 2}=(X_1^2,\ldots,X_p^2)^T\)
let \(x_i\) and \(x^{\cdot 2}{_{i}}\) be the \(i\)th observation for \(X\) and \(X^{\cdot 2}\), respectively
let \(y_i\) be the class label for \(x_i\)
The original feature space of dimension \(p\), generated by components of feature vector \(X=(X_1,\ldots,X_p)^T\), can be enlarged in different ways
However, enlarging original feature space may lead to unmanageable computations in the optimization problem that gives the corresponding classifier, if too many transforms of feature variables are incorporated
We need a sufficiently versatile and computationally efficient way to enlarge a feature space, and one such is the support vector machine (SVM)
Support vector classifier (SVC) uses the Euclidean inner product on the feature space and solves the optimization problem to obtain the optimal hyperplane and its induced classifier
Support vector machine (SVM) replaces the Euclidean inner product on the feature space by a kernel and solves the same type of optimizatin problem to obtain an optimal hypersurface and its induced classifier, yielding a nonlinear decision boundary in the original feature space when needed
An inner product on \(\mathbb{R}^p \times \mathbb{R}^p\) is a bivariate function \(H\) of two vectors that is, for \(\lambda,\gamma,r,s \in \mathbb{R}\) and \(\mathbf{a},\mathbf{b},\mathbf{c},\mathbf{d} \in \mathbb{R}^p\),
bi-linear, i.e., linearity in each argument, i.e., \[ \begin{aligned} & H(\lambda \mathbf{a} + \gamma \mathbf{c}, r \mathbf{b} + s \mathbf{d}) \\ = & \lambda r H( \mathbf{a}, \mathbf{b}) + \lambda s H( \mathbf{a}, \mathbf{d}) + \gamma r H(\mathbf{c}, \mathbf{b}) + \gamma s H(\mathbf{c}, \mathbf{d}) \end{aligned} \]
positive definite, i.e., \(H( \mathbf{a},\mathbf{a}) >0\) unless \(\mathbf{a}=0\)
The Euclidean inner product \(\langle \mathbf{a},\mathbf{b}\rangle = \sum_{i=1}^p a_i b_i\) for \(\mathbf{a}=(a_1,\ldots,a_p)^T,\mathbf{b}=(b_1,\ldots,b_p)^T \in \mathbb{R}^p\) is just one inner product among many inner products
For a support vector classifier (SVC), its optimal hyperplane \(\hat{S}=\hat{S}(\hat{\alpha},\hat{\beta}_0)\) has \(\hat{\alpha}=\sum_{i=1}^n \hat{a}_i y_i x_i\). Since Euclidean inner product \(\langle \cdot,\cdot\rangle\) is symmetric and bi-linear, we see \[ \langle x, \hat{\alpha} \rangle + \hat{\beta}_0 = \sum_{i=1}^n \hat{a}_i y_i \langle x,x_i \rangle + \hat{\beta}_0 \]
Let \[f(x; \hat{\alpha},\hat{\beta_0}) = \sum_{i=1}^n \hat{a}_i y_i \langle x,x_i \rangle + \hat{\beta}_0\] Then \(\hat{S}\) is the solution in \(x \in \mathbb{R}^p\) to \(f(x; \hat{\alpha},\hat{\beta_0})=0\), and \(f\) is called a linear representation for the SVC induced by \(\hat{S}\)
For \(\mathbf{a},\mathbf{b} \in \mathbb{R}^p\), some kernels include:

Figure 12.3 of Supplementary Text:

In essence, an SVM implicitly enlarges the feature space via the use of a kernel
One advantage of using a kernel over explicitly enlarging feature space is that we need only compute \(K(x_i,x_{i'})\) for all \(\binom{n}{2}\) distinct pairs \((i,i')\), without explicitly working in the enlarged feature space
This is important because in many applications of SVMs, the enlarged feature space can be so large that computations are intractable
From Chapter 12 of Supplementary Text, we see that an SVM maximizes the “Lagrange dual function” \[ L_{D} = \sum_{i=1}^n a_{i} - 2^{-1}\sum_{i=1}^n \sum_{i'=1}^n a_{i} a_{i'}y_{i} y_{i'} K( x_i, x_{i'}) \] with constraint \(\sum_{i=1}^n a_i y_i =0\) and \(0 \le a_i \le C\)
In the optimal solution \[ f_K(x; \hat{\alpha},\hat{\beta_0}) = \sum_{i=1}^n \hat{a}_i y_i K(x,x_i) + \hat{\beta}_0, \] an \(x_i\) such that \(\hat{a}_i \ne 0\) is a support vector
Heart data: 13 features such as Age, Sex and Chol (serum cholestoral in mg/dl), and \(303\) observationsAHD=Yes or AHD=No Age Sex Chol AHD
1 63 1 233 No
2 67 1 286 Yes
3 67 1 229 Yes
AHD=Yes or AHD=No depending on whether \(\hat{f}(X)<t\) or \(\hat{f}(X) \ge t\)SVC seems to be better than LDA; SVM with \(\gamma=0.1\) radial kernel seems to give perfect ROC curve

90 test observations; SVC seems to be better than LDA; SVM with \(\gamma=0.1\) radial kernel seems to be worst

When there are \(s>2\) classes, we can do either
Recall the linear representation for support vector classifier (SVC) \[f(x; {\beta}, {\beta_0}) = \langle x, \beta \rangle + {\beta}_0\] with \(\beta=(\beta_1,\ldots,\beta_p)^T\)
The optimization problem that produces SVC is equivalent to \[ \min_{\beta_0 \in \mathbb{R},\alpha \in \mathbb{R}^p} \left\{ \overbrace{\sum_{i=1}^n \underbrace{\max[0,1-y_i f(x_i; {\beta}, {\beta_0})]}_{\text{hinge loss}}}^{\text{empirical loss } L(\mathbf{X},\mathbf{y})} + \overbrace{\underbrace{\lambda \Vert \beta \Vert^2}_{\text{ridge penalty}}}^{\text{penalty } P(\beta)} \right\} \]

The hinge loss \(l(t)=\max(0,1-t)\) is zero if \(t \ge 1\), i.e., the loss is zero if \[y_i(\langle x_i, \beta \rangle + {\beta}_0) \ge 1,\] i.e., if \(x_i\) is on the correct side of the margin
But the loss function for logistic regression is not exactly zero anywhere, and it is very small for observations that are far from the decision boundary
Hence, logistic regression and SVMs often give very similar results. When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12