
Classification is a very common and important task:
Classification is a more complicated task than clustering, since
Classification methods can be approximately categorized into “model-based” or “non-model based”.
Some non-model based methods: k-nearest neighbor (kNN) classifiers, Support vector machines, Neural networks, Tree-based classifiers
Some model-based methods: Bayes classifier, Discriminant analysis, Logistic regression
The Bayes classifier
employs a Bayes model that uses prior, conditional, marginal, and posterior probabilities (or distributions)
minimizes the expected 0-1 loss and is hence optimal (to be discussed later)
serves as an unattainable gold standard against which to compare other classifiers (since for real data, we do not know the conditional distributions needed to implement this classifier)
Flip a fair coin; if it lands on head, then generate an observation from the bivariate Gaussian distribution with mean vector \((3,0)\) and identity covariance matrix \(\mathbf{I}\), whereas if it lands on tail, then generate an observation from the bivariate Gaussian distribution with mean vector \((0,-4)\) and identity covariance matrix
Repeat the above 10 times independently in order to obtain 10 observations
Namely, the 10 observations form 2 distinct groups according to their mean vectors, and there are 2 features.
10 observations form 2 classes (observations 1, 4, 7 and 9 form a class, and the rest form another class):

The data generating process can be formulated as follows:
There are 2 classes, i.e., \(Y_i =1\) for “Class 1” or \(Y_i=2\) for “Class 2”, such that, for some \(p \in (0,1)\) \[ \Pr(Y_i =1)=p, \quad \Pr(Y_i =2)=1-p \]
If \(Y_i =1\), then \(x_i \sim \text{Gaussian}(\mu_1,\mathbf{\Sigma}_1)\), whereas if \(Y_i =2\), then \(x_i \sim \text{Gaussian}(\mu_2,\mathbf{\Sigma}_2)\)
Note: in the example, \(n=10\), \(p=0.5\), \(\mu_1=(3,0)\), \(\mu_2=(0,-4)\), \(\mathbf{\Sigma}_1=\mathbf{\Sigma}_2=\mathbf{I}\)
\(\Pr(Y_i =1)=p\) and \(\Pr(Y_i =2)=1-p\) are called the prior probabilities for class labels
“If \(Y_i =1\), then \(x_i \sim \text{Gaussian}(\mu_1,\mathbf{\Sigma}_1)\)” means “the conditional density \(f_1\) of \(x_i\) given \(Y_i=1\) is the Gaussian density with mean vector \(\mu_1\) and covariance matrix \(\mathbf{\Sigma}_1\)”, and is written as \[ f_1(x | Y_i=1) \sim \text{Gaussian}(\mu_1,\mathbf{\Sigma}_1) \]
Similarly, “if \(Y_i =2\), then \(x_i \sim \text{Gaussian}(\mu_2,\mathbf{\Sigma}_2)\)” means “the conditional density of \(x_i\) given \(Y_i=2\), denoted by \(f_2(x | Y_i=2)\) , is the Gaussian density with mean vector \(\mu_2\) and covariance matrix \(\mathbf{\Sigma}_2\)”
Given prior probabilities \[\Pr(Y_i =1)=p, \quad \Pr(Y_i =2)=1-p\] for the 2 classes and the conditional densities for \(x_i\) as \[f_j(x | Y_i=j) \sim \text{Gaussian}(\mu_j,\mathbf{\Sigma}_j),j=1,2\]
the marginal density of \(x_i\) is \[ f(x) = f_1(x | Y_i=1) \Pr(Y_i =1) + f_2(x | Y_i=2) \Pr(Y_i =2) \]
the posterior probability of \(x_i\) belonging to Class \(j\), i.e., the conditional probability of \(Y_i=j\) given \(x_i\), is \[ \Pr(Y_i=j|x_i) = \frac{f_j(x_i | Y_i=j) \Pr(Y_i =j)}{f(x_i)} \]
For the example,
For \(x_1=(3.132420,-2.311069)\), we have
The posteriors are:
\[ \begin{aligned} \Pr(Y_1=1|x_1) & = \frac{f_1(x_1 | Y_1=1) \Pr(Y_1 =1)}{f(x_1)} \\ & = \frac{0.5 \times 0.01091}{0.00560}=0.9747 \end{aligned} \]
\[ \begin{aligned} \Pr(Y_1=2|x_1) & = \frac{f_2(x_1 | Y_1=2) \Pr(Y_1 =2)}{f(x_1)} \\ & = \frac{0.5 \times 0.00028}{0.00560}=0.025259 \end{aligned} \]
The Bayes classifier proposes to classify an observation to the class for which the corresponding posterior probability for the observation to belong to this class is the largest among all such posterior probabilities (that are computed for each class).
Namely, \(x_i\) will be classified to Class \(\hat{j}\) if \[ \hat{j} = \operatorname*{argmax}\{j\in \{1,2\}: \Pr(Y_i=j|x_i)\} \]
Note: \(\operatorname*{argmax}\) returns the argument at which the maximum being sought is obtained
For the example,
for observation \(x_1=(3.132420,-2.311069)\), we have \[\Pr(Y_1=1|x_1)=0.9747, \quad \Pr(Y_1=2|x_1)=0.025259\]
Since \[\Pr(Y_1=1|x_1) > \Pr(Y_1=2|x_1),\] the Bayes classifier will classify \(x_1\) to Class \(1\)
since there are 2 classes and \(\Pr(Y_1=1|x_1) + \Pr(Y_1=2|x_1)=1\), we must have \(\Pr(Y_1=1|x_1) > \Pr(Y_1=2|x_1)\) if and only if (“iff”) \(\Pr(Y_1=1|x_1) > 0.5\). So, the Bayes classifier will classify \(x_1\) to Class \(j\) if \(\Pr(Y_1=j|x_1) > 0.5\)
Posterior probability of each observation belonging to a class:
PPCL1 PPCL2
[1,] 9.747409e-01 2.525912e-02
[2,] 1.046019e-03 9.989540e-01
[3,] 2.095184e-06 9.999979e-01
[4,] 1.000000e+00 1.683903e-10
[5,] 1.384877e-05 9.999862e-01
[6,] 5.296168e-02 9.470383e-01
[7,] 9.999762e-01 2.380586e-05
[8,] 6.620328e-04 9.993380e-01
[9,] 1.000000e+00 3.390852e-08
[10,] 7.972362e-04 9.992028e-01
True class labels (TrueClass) and labels (EstClass) obtained by the Bayes classifier:
PPCL1 PPCL2 TrueClass EstClass
1 9.747409e-01 2.525912e-02 1 1
2 1.046019e-03 9.989540e-01 2 2
3 2.095184e-06 9.999979e-01 2 2
4 1.000000e+00 1.683903e-10 1 1
5 1.384877e-05 9.999862e-01 2 2
6 5.296168e-02 9.470383e-01 2 2
7 9.999762e-01 2.380586e-05 1 1
8 6.620328e-04 9.993380e-01 2 2
9 1.000000e+00 3.390852e-08 1 1
10 7.972362e-04 9.992028e-01 2 2
Classification errors are zero:
1 2
1 4 0
2 0 6
To implement the Bayes classifier, we need to be able to obtain the marginal and posterior probabilities.

Assume there are \(p\) features, and let \(x_{i}\), \(i=1,\ldots,N\) be the \(i\)th observation for these features
Let \(G_{i}\) be the class label for \(x_{i}\) for each \(i=1,\ldots,N\)
The \(N\) observations \(x_i\) form 2 distinct classes, i.e., \(G_{i}\in \mathcal{G}=\{1,2\}\) for all \(i\), and \(G_{i}\)’s are known; here \(\mathcal{G}\) is the set of distinct class labels
The task: given a new observation \(x\) (from \(X\)) for the \(p\) features, estimate its class label \(G\), i.e., classify \(x\) into one of the 2 classes
The observations on the features, i.e., \(x_i\)’s, are often regarded as a sample from a random vector \(X\)
In Bayesian statistics, the class labels, i.e., \(G_i\)’s, are regarded as a sample from a random variable \(G\), to capture the postulation that for each observation \(x'\) from \(X\), its class label \(G'\) is random
The performance of a classifier is measured by how well it classifies new observations on avarage by accounting for the randomness of \(X\) and \(G\) when new observations are obtained
The “0-1 loss” as a measure on classification performance in a two-class classification task:

Let \(\hat{G}(x)\) be the estimated class label for an observation \(x\) from \(X\), i.e., \(\hat{G}(x)=1\) or \(2\)
For \(k,l \in \{1,2\}\), the 0-1 loss \(L\) is defined as: \[L(k,l)=0 \text{ if } k=l \text{ and } L(k,l) =1 \text{ if } k \ne l \]
Namely, let \(G(x)\) be the true class label for \(x\), then \(L(G(x),\hat{G}(x))=0\) when \(G(x)=\hat{G}(x)\) but \(L(G(x),\hat{G}(x))=1\) when \(G(x) \ne \hat{G}(x)\)
Namely, the loss \(L\) is \(0\) if and only if an observation is correctly classified; otherwise \(L\) is \(1\)
Let \(G(X)\) denote the true class label of \(X\). For each \(k \in \{1,2\}\), let \(\mathcal{G}_{k}=\left\{G(x):G(x)=k\right\}\) when \(X=x\); namely, \(\mathcal{G}_{k}\) is the set of class labels for Class \(k\) only
Then the expected 0-1 loss is \[ \begin{aligned} \rho & =E(L[G(X),\hat{G}(X)]) \\ & = E_{X}\left( \sum_{k=1}^{2}L[\mathcal{G}_{k},\hat {G}(X)]\Pr(\mathcal{G}_{k}|X)\right), \end{aligned} \] where \(E_{X}\) denotes expectation with respect to the distribution of \(X\)
Recall \(\rho(x; \hat{G}) = \sum_{k=1}^{2}L[\mathcal{G}_{k},\hat{G}(x)]\Pr(\mathcal{G}_{k}|x)\)
Let \(q_k(x)=\Pr(\mathcal{G}_{k}|x)\) for \(k \in \{1,2\}\) and set \[q(x)=\max\{q_1(x), q_2(x)\}\]
If \(q(x)=q_1(x)\) and we set \(\hat{G}(x)=1\), then \[ L[\mathcal{G}_{1},\hat{G}(x)]=0, \quad L[\mathcal{G}_{2},\hat{G}(x)]=1 \] \(\rho(x; \hat{G}) = 0 \times q_1(x) + 1 \times q_2(x) = q_2(x)=\min\{q_1(x),q_2(x)\}\)
Similarly, if \(q(x)=q_2(x)\) and we set \(\hat{G}(x)=2\), then \(\rho(x; \hat{G}) = 1 \times q_1(x) + 0 \times q_2(x) = q_1(x)=\min\{q_1(x),q_2(x)\}\)
Since \(q_k(x)=\Pr(\mathcal{G}_{k}|x)\) and \(q_1(x)+q_2(x)=1\), in either case discussed previously, we must have \[ \begin{aligned} \rho(x; \hat{G}) & = \min\{\Pr(\mathcal{G}_{1}|x), \Pr(\mathcal{G}_{2}|x)\}\\ & = 1 - \max\{\Pr(\mathcal{G}_{1}|x), \Pr(\mathcal{G}_{2}|x)\} \end{aligned} \] when the classifier \(\hat{G}\) assigns an observation \(x\) from \(X\) to Class \(j\), where \[\Pr(\mathcal{G}_{j}|x)=\max\{\Pr(\mathcal{G}_{k}|x): k=1,2\}\]
Thus, \(\rho(x; \hat{G})\) is minimized by the classifier given above, i.e., \[ \hat{G}(x) = \operatorname*{argmax} \{ g\in \{1,2\} : \Pr\left(g|X=x\right) \} \]
The the expected 0-1 loss \[ \rho =E(L[G(X),\hat{G}(X)]) \] is minimized by the classifier \[ \hat{G}(x) = \operatorname*{argmax} \{ g\in \{1,2\} : \Pr\left(g|X=x\right) \} \]
The Bayes classifier assigns an observation \(x\) from \(X\) to class \(j\), for which \(\Pr(G=j|X=x)\) is the largest among all \(\Pr(G=k|X=x)\)’s for \(k \in \mathcal{G}=\{1,2\}\)
Assume there are \(p\) features, and let \(x_{i}\), \(i=1,\ldots,N\) be the \(i\)th observation for these features
Let \(G_{i}\) be the class label for \(x_{i}\) for each \(i=1,\ldots,N\)
The \(N\) observations \(x_i\) form \(K\) distinct classes, i.e., \(G_{i}\in\mathcal{G}=\{1,\ldots,K\}\) for all \(i\)
The task: given a new observation \(x\) (from \(X\)) for the \(p\) features, estimate its class label \(G\), i.e., classify \(x\) into one of the \(K\) classes
The observations on the features, i.e., \(x_i\)’s, are often regarded as a sample from a random vector \(X\)
In Bayesian statistics, the class labels, i.e., \(G_i\)’s, are regarded as a sample from a random variable \(G\), to capture the postulation that for each observation \(x'\) from \(X\), its class label \(G'\) is random
The performance of a classifier is measured by how well it classifies new observations on avarage by accounting for the randomness of \(X\) and \(G\) when new observations are obtained
Let \(\hat{G}(x)\) be the estimated class label for an observation \(x\) from \(X\), i.e., \(\hat{G}(x) \in \mathcal{G}=\{1,\ldots,K\}\)
For \(k,l \in \mathcal{G}\), the 0-1 loss \(L\) is defined as \[L(k,l)=0 \text{ if } k=l \text{ and } L(k,l) =1 \text{ if } k \ne l. \]
Namely, the 0-1 loss \(L\) is \(0\) if and only if an observation is correctly classified; otherwise \(L\) is \(1\).
Note: there can be other losses to measure the performance of a classifier
Recall \(\delta =1-\Pr(\mathcal{G}_{g}|X=x)\). Then \(\delta\) is minimized if the estimated class label \(g\) is \[ \hat{G}(x) = \operatorname*{argmax}\nolimits_{k\in\mathcal{G}} \Pr\left(k|X=x\right) \]
Namely, \(\delta\) is minimized if the classifier \(\hat{G}\) assigns \(x\) to Class \(g\), for which \(\Pr\left(\mathcal{G}_{g}|X=x\right)\) is the largest among all \(\Pr\left(\mathcal{G}_k|X=x\right), k \in \mathcal{G}\)
If \(\delta\) is minimized, then so is the expected 0-1 loss \(\rho\). So, the classifier \(\hat{G}\) given above is the minimizer of \(\rho\)
The minimizer \(\hat{G}\) of the expected 0-1 loss \[\rho=E[L(G,\hat{G}(X))]\] and hence the optimal classifier sets \[ \hat{G}(x)=\mathcal{G}_{g}\quad\text{iff}\quad\Pr(\mathcal{G}_{g}|X=x)=\max _{k\in\mathcal{G}}\Pr(k|X=x) \]
Namely, the minimizer \(\hat{G}\) assigns an observation \(x\) from \(X\) to the most plausible Class \(g\), where “plausibility” of Class \(k\) is measured by \(\Pr(k|X=x)\), the posterior probability of \(X\) belonging to Class \(k\) given \(X=x\)
The classifier \(\hat{G}\) is called the Bayes classifier
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12