
Nearest neighbor methods
are a class of classification methods and belong to the family of supervised learning methods, where methods are trained from observations that have class labels
classify a new observation to a class using features of its neighboring observations
can be regarded as estimating the unknown conditional probabilities that are used by the Bayes classifier
The contents on kNN classifiers are adapted or adopted from:
Chapter 2 of the Text “An Introduction to Statistical Learning with Applications in R (Corrected at 8th printing)”
Section 13.3. of the Supplementary Text “The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Edition)”
Other sources
Note: the Text uses “KNN” and the Supplementary Text “kNN”; we use “kNN” for “KNN”; we will focus on 2-class problems unless otherwise noted
Training set: \(n\) observations \((x_{1},y_{1}),\ldots,(x_{n},y_{n})\)
The task: For a new observation \((x,y)\) with known \(x\in\mathbb{R}^{p}\) but unknown \(y\in \{0,1\}\), estimate its label \(y\), i.e., classify \(x\) into one of the \(2\) classes
Note: the set \(\mathcal{T}=\{(x_{i},y_{i}): i=1,\ldots,n)\}\) is called the training set, and the new observation \((x,y)\notin\mathcal{T}\) is called a test observation
k-nearest-neighbor (kNN) methods:
A kNN method needs
So, choice of each of the above quantities affects kNN classification results
Unless otherwise noted, Euclidean distance will be used to measure how close two observations are, and when there are 2 classes, the threshold on the avarage label will be \(0.5\)

200 observations form two distinct groups with equal size

“flexibility” here refers to “degree of accommodating local properties of features”

When there are 2 classes, a kNN classifier sets the class label of \(x\) as \[ \hat{Y}(x) = \left\{ \begin{array} {lll} 1 & \text{if} & \bar{Y}(x)=\frac{1}{k}\sum\nolimits_{x_{i}\in N_{k}(x)}y_{i} > 0.5 \\ 0 & \text{if} & \bar{Y}(x)=\frac{1}{k}\sum\nolimits_{x_{i}\in N_{k}(x)}y_{i} \le 0.5 \end{array}\right. \]
When \(k=1\), there is only 1 training observation \(x_{i'} \in N_{1}(x)\) and \(\bar{Y}(x) = y_{i'}\), the label of \(x_{i'}\), i.e., \(\hat{Y}(x)\) is determined solely by \(y_{i'}\), leading to an “overly flexible” decision boundary that is mostly determined by local properties of features
So, if we have a complicated data set for classification, then choosing a smaller \(k\) is more sensible
Recall \(\bar{Y}(x)=\frac{1}{k}\sum\nolimits_{x_{i}\in N_{k}(x)}y_{i}\)
When \(k\) is much larger than \(1\), then \(\bar{Y}(x)\) is the average of considerably many \(y_{j}\)’s, whose corresponding \(x_{j}\)’s in \(N_{k}(x)\) can span a large subset of the feature space
So, when two feature observations \(x\) and \(x'\) are not far apart, their neighborhoods \(N_{k}(x)\) and \(N_{k}(x')\) will not contain quite different \(x_{j}\)’s, leading to similar \(\bar{Y}(x)\) and \(\bar{Y}(x')\) and similar \(\hat{Y}(x)\) and \(\hat{Y}(x')\), and a “insufficiently flexible” decision boundary that is relatively smooth
10 observations form 2 classes (observations 1, 4, 7 and 9 form a class, and the rest form another class):

Recall that when there are 2 classes, a kNN classifier sets the class label of \(x\) as \[ \hat{Y}(x) = \left\{ \begin{array} {lll} 1 & \text{if} & \bar{Y}(x)=\frac{1}{k}\sum\nolimits_{x_{i}\in N_{k}(x)}y_{i} > 0.5 \\ 0 & \text{if} & \bar{Y}(x)=\frac{1}{k}\sum\nolimits_{x_{i}\in N_{k}(x)}y_{i} \le 0.5 \end{array}\right. \]
So, it is important (but not easy) to choose a neighborhood size \(k\), not too small and not too large, to obtain good classification performance for a kNN classifier
For an observation \(x\) on the random vector \(X\), let \(Y(x) \in \{0,1\}\) be the true class label of \(x\), and \(\hat{Y}(x) \in \{0,1\}\) the estimated class label for \(x\)
For \(k,l \in \{0,1\}\), the 0-1 loss \(L\) is defined as: \[L(k,l)=0 \text{ if } k=l \text{ and } L(k,l) =1 \text{ if } k \ne l \] Namely, the loss \(L\) is \(0\) if and only if an observation is correctly classified; otherwise \(L\) is \(1\)
The expected 0-1 loss is \(\rho=E(L[Y(X),\hat{Y}(X)]),\) where \(E\) denotes expectation. \(\rho\) is unknown in practice and needs to be estimated
Consider a classifier \(\hat{Y}: x \mapsto \hat{Y}(x) \in \{0,1\}\):
Consider the classifier \(\hat{Y}: x \mapsto \hat{Y}(x) \in \{0,1\}\):
Test errors: 0.1304 (Bayes error), 0.1363 (k=10), 0.1695 (k=1), 0.1925 (k=100):

There seems to be a best value for \(k\):

It is not necessarily true that a smaller \(k\) leads to a larger test error rate for a kNN classifier
The training error rate is zero for 1-NN classifier
There are values of \(k\) for which the corresponding kNN classifiers have the smallest test error rate among a family of kNN classifiers
The test error rate of a kNN classifier can be close or equal to the Bayes error rate
\(k=7\) appears to be optimal for minimizing the test error but not for minimizing the cross-validated test error:

For a \(2\)-class problem and class label \(j \in \{0,1\}\),
logistic regression and the Bayes classifier both compute \[\hat{f}_j(x)=\Pr(Y=j|X=x)\] and assign \(x\) to class \(g\) if \[ \Pr(Y=g|X=x)>0.5 \]
kNN classifier computes \[\hat{g}_j(x)=k^{-1}\sum\nolimits_{x_{i}\in N_{k}(x)}1\{y_{i}=j\}\]
when \(k\) is large and \(k/n\) is small,
\[\hat{g}_j(x) \approx \Pr(Y=j|X=x)\] and the 3 classifiers are similar
For a \(K\)-class classification task and \(j \in \mathcal{G}=\{1,\ldots,K\}\),
Note: \(1\{y_{i}=j\}=1\) if \(y_i=j\) and \(0\) otherwise
The choice of \(k\) can greatly affect the decision boundary and hence classification error of a kNN classifier
For each data generating process, there seems to be an optimal \(k\) for which the test error of this kNN classifier is minimal among a family of kNN classifiers
In practice, the choice of \(k\) can be determined by the cross-validation method. However, this method may not provide a good choice of \(k\) when there are insufficient number of observations to use
Consider observations \(\left\{ x_{i}\right\} _{i=1}^{n}\) for the same \(p\) features, and assume the \(x_i\)’s follow the same probability distribution \(P_{X}\)
then \(p\) is the dimensionality of the feature space
Since \(P_{X}\) assigns total mass \(1\) to the feature space, less mass is distributed in each of the \(p\) directions as \(p\) increases. Hence, as \(p\) increases, it is harder to find \(k\) neighboring points around a point \(x\) in the feature space
Namely, as \(p\) increases, it is harder to construct a \(k\)-point neighborhood \(N_{k}\left( x\right)\). This is related to the curse of dimensionality
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12