
The contents on neural networks are adapted or adopted from:
We will only discuss “vanilla” neural networks for classification tasks
Settings for \(K\)-class classification task:
Target: classify \(X\) into one of the \(K\) classes
A feedforward vanilla NN with one hidden layer:

Structure of a vanilla feedforward NN with one hidden layer:
Derived features \(Z_{m},m=1,\ldots,M\) are created from linear transforms of feature vector \(X\):
Linearly transformed features: \[\tilde{x}_m=\alpha_{0m}+\alpha_{m}^{T}X,m=1,\ldots,M\] for some parameters \(\alpha_{0m} \in \mathbb{R}\) and \(\alpha_{m} \in \mathbb{R}^p\) and \(M\)
\(\sigma\) is often a sigmoid function as \[\sigma(v)=1/(1+e^{-v}), v \in \mathbb{R}\]
A sigmoid (i.e., S-shaped) function with two asymptotes:

Note: the dimension of \(T\) is equal to \(K\), the number of classes
A vanilla NN is a nonlinear mapping \(f: \mathbb{R}^p \to \mathbb{R}^K\) whose \(k\)th component mapping is \[ \begin{aligned} & f_k(X) = g_k(T) = g_k(\beta_{01}+\beta_{1}^{T}Z,\ldots,\beta_{0K}+\beta_{K}^{T}Z)\\ & = g_k\left(\beta_{01}+\beta_{1}^{T} (\sigma\left( \alpha_{01}+\alpha_{1}^{T}X \right),\ldots,\sigma\left( \alpha_{0M}+\alpha_{M}^{T}X \right))^T,\ldots, \right. \\ & \phantom{{}=1} \left. \beta_{0K}+\beta_{K}^{T} (\sigma\left( \alpha_{01}+\alpha_{1}^{T}X \right),\ldots,\sigma\left( \alpha_{0M}+\alpha_{M}^{T}X \right))^T \right) \end{aligned} \]
The more layes an NN has and more nonlinear functions an NN uses, the more complicated the mapping from the input vector \(X\) to output vector \(Y=(Y_1,\ldots,Y_k)^T\) is, and the harder it is to analyze the NN
Note: the above decision rule is similar to the Bayes decision rule
A feedforward vanilla NN with one hidden layer:

The feedforward vanilla NN we have discussed has:
Activation function \(\sigma\), often as the sigmoid function \[\sigma\left(v\right)=1/\left(1+e^{-v}\right)\]
Terminal mappings \(g_k\), often as the softmax function \[g_{k}\left( T\right) = \frac{e^{T_{k}}}{\sum_{l=1}^{K}e^{T_{l}}} \] for a \(K\)-class classification task
Architectrue: forward and 3 layers (1 input layer, 1 hidden layer, and 1 output layer)
In general, the architecture of an NN
Convolutional neural networks (CNNs) are powerful models for image analysis, natural language processing, drug discovery, Go game, etc, and they employ convolution operations and the rectifier as activation function
Forward; multilayer; with convolution operators; with rectifier as activation function:

Looped; multilayer; with convolution operators; with rectifier as activation function:

Top 10 CNNs for classification:
Credit: towardsdatascience.com
Unknown parameters of an NN are called “weights” and are stored in a vector \(\theta\)
We seek values of \(\theta\) that make the model fit the data well based on a criterion
For the vanilla NN, \(\theta\) consists of \(M\left(p+1\right)\) weights \[ \alpha_{0m} \in \mathbb{R},\alpha_{m} \in \mathbb{R}^p, m=1,\ldots,M \] and \(K\left( M+1\right)\) weights \[ \beta_{0k} \in \mathbb{R},\beta_{k}\in \mathbb{R}^M, k=1,\ldots,K \]
Note: activation function \(\sigma\) and terminal functions \(g_k\) are prespecified and do not need to be estimated
Given \(n\) observations \(\left\{ x_{i}\right\} _{i=1}^{n}\) for \(X\) and \(n\) observations \(\left\{ y_{ik}\right\} _{i=1}^{n}\) for each \(Y_{k},k=1,\ldots,K\):
Number of hidden units and layers relates to number of weights and level of complexity of an NN
Minimizing either criterion, usually done via gradient descent and if a minimizer \(\hat{\theta}\) exists, gives a choice of \(\theta\)
However, an NN is often over-parametrized, and we do not seek for a global minimizer \(\hat{\theta}\) of \(R\left( \theta\right)\) to potentially avoid overfitting. So, regularization on weights \(\theta\) is recommended to mitigate overfitting, and/or a validation dataset is used to check for overfitting
When optimizing \(R\left( \theta\right)\), starting values for weights are chosen to be random values near zero
Multiple minima: \(R\left( \theta\right)\) is often nonconvex and possesses many local minima. So, a minimizer \(\hat{\theta}\) may heavily depend on the initial values for the weights. One recommendation is to try a number of random starting configurations for the NN
There are many things we do not understand about NNs
neuralnet that builds simple neural networksneuralnet, compute and plot from package neuralnetneuralnetneuralnet{neuralnet} trains neural networks. It allows flexible settings through custom-choice of error and activation function. Its basic syntax is:
neuralnet(formula, data, hidden = 1,err.fct = "sse",
act.fct = "logistic", linear.output = TRUE,...)
formula: a symbolic description of the model to be fitteddata: a data frame containing the variables specified in formulahidden: a vector of integers specifying the number of hidden neurons in each hidden layer, and the length of the vector is the number of hidden layersneuralnetBasic syntax:
neuralnet(formula, data, hidden = 1,err.fct = "sse",
act.fct = "logistic", linear.output = FALSE,...)
err.fct: a differentiable function that is used for the calculation of the error. Alternatively, the strings “sse” (for the sum of squared errors) and “ce” (for the cross-entropy) can be usedneuralnetBasic syntax:
neuralnet(formula, data, hidden = 1,err.fct = "sse",
act.fct = "logistic", linear.output = FALSE,...)
act.fct: a differentiable function that is used to obtain derived features, i.e., activation function. Additionally the strings, “logistic” and “tanh” are possible for the logistic function and tangent hyperbolicus, respectively. A logistic activation function is a sigmoid functionlinear.output: logical. If act.fct should not be applied to the output neurons, set linear.output to be TRUE; otherwise, set it to be FALSEneuralnetneuralnet returns an object of class nn that is a list containing the following components:
model.list: a list containing the covariates and the response variables extracted from the formula argumenterr.fct and act.fct: the error function and activation function, respectivelynet.result: a list containing the overall result of the neural network for every repetitionneuralnetneuralnet returns an object of class nn that is a list containing the following components:
weights: a list containing the fitted weights of the neural network for every repetitionresult.matrix: a matrix containing the reached threshold, needed steps, error, AIC and BIC (if computed) and weights for every repetition. Each column represents one repetitionstartweights: a list containing the start weights of the neural network for every repetitioncomputecompute{neuralnet} provides prediction of neural network of class nn, produced by neuralnet(). Its basic syntax is:
neuralnet::compute(x, covariate,rep = 1)
x : an object of class nncovariate: a data frame or matrix containing the variables that had been used to train the neural networkrep: an integer indicating the neural network’s repetition which should be usedcomputecompute{neuralnet} returns a list containing the following components:
neurons: a list of the neurons’ output for each layer of the neural networknet.result: a matrix with nrow(covariate) rows and K columns, where K is the number of classes. The k-th column contains the predicted probability that an observation belongs to class kNote: compute{neuralnet} is replaced by predict.nn{neuralnet}, and the latter only returns the equivalent of net.result
Codes to obtain estimated class labels:
nnpred = neuralnet::compute(x, covariate,rep = 1)
classProbs = nnpred$net.result
if (which.max(classProbs[i,])==k) LabelEst[i]=k
k if the probability of \(x_i\) belonging to class k is the largest, i.e., \[k =\operatorname{argmax}_{1 \le j \le K} g_j(T),\] where \[f_j(X) = g_j(T) = \frac{e^{T_{j}}}{\sum_{l=1}^{K}e^{T_{l}}},\] \(X\) is the feature vector, \(f_j\) the \(j\)-th component mapping in the NN, and \(T=(T_1,\ldots,T_K)^T\)Streamlined commands to obtain estimated class labels from an NN:
# fit model
nnfit=neuralnet(formula, data, hidden = 1,err.fct = "sse",
act.fct = "logistic", linear.output = FALSE,...)
# obtain predicted class probabilities
nnpred=neuralnet::compute(nnfit,covariate)
classProbs = nnpred$net.result
# assign class labels
LabelEst = rep(1,nrow(covariate))
for (i in 1:nrow(covariate)) {
if (which.max(classProbs[i,])==k) LabelEst[i]=k
}
predict.nnpredict.nn{neuralnet} provides prediction of neural network of class nn, produced by neuralnet(). Its basic syntax is:
predict(object, newdata, rep = 1, all.units = FALSE, ...)
object: neural network of class nnnewdata: new data of class data.frame or matrixrep: integer indicating the neural network’s repetition which should be usedall.units: return output for all units instead of final output only...: further arguments passed to or from other methodspredict.nnpredict.nn{neuralnet} returns a matrix of predictions:
all.units=TRUE, a list of matrices with output for each unit will be givennet.result returned by compute{neuralnet}plot.nnplot.nn{neuralnet} is a method for the generic plot. It is designed for an inspection of the weights for objects of class nn, typically produced by neuralnet. Its basic syntax is:
plot(x,rep="best",intercept=F,information=F,
show.weights=F,col.hidden="blue")
x: an object of class nnrep: repetition of the neural network. If rep="best", the repetition with the smallest error will be plotted. If not stated, all repetitions will be plotted, each in a separate windowplot.nnBasic syntax:
plot(x,rep="best",intercept=F,information=F,
show.weights=F,col.hidden="blue")
intercept: a logical value indicating whether to plot the interceptinformation: a logical value indicating whether to add the error and steps to the plotcol.hidden: color of the neurons in the hidden layer(s)Codes for example are adopted from
Sepal.Length, Sepal.Width, Petal.Length and Petal.Width are stored in object iris (in R library ggplot2 or class)> library(class); dim(iris)
[1] 150 5
> iris[1,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
> unique(iris$Species)
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
Proposal for data splitting and cross-validation (cv):
Then apply optimal NN to test set
> library(class); set.seed(314) # seed needed!!
> trainId=c(sample(1:50,40),sample(51:100,40),
+ sample(101:150,40))
> testId = (1:150)[-trainId]
> trainingSet=iris[trainId,]; testSet=iris[testId,1:4]
> testLabs=iris$Species[testId]
> trainingLabs=iris$Species[trainId]
> m=5
> folds=sample(1:m,nrow(trainingSet),replace=TRUE)
Codes for the architecture:
neuralnet(Species~., data, hidden = c(5,4),
act.fct = "logistic",linear.output = FALSE)
> nnTrain=neuralnet(Species~., trainingSet, hidden = c(5,4),
+ act.fct = "logistic",linear.output = FALSE)
> library(neuralnet)
> plot(nnTrain,show.weights=F,information=F,intercept=F,
+ rep="best",col.hidden="blue")

\(5\)-fold cv for NN classifier on training Set:
> library(neuralnet); set.seed(123)
> nnModels = vector("list",m) # save each estimated model
> testErrorCV = double(m)
> for (s in 1:m) {
+ trainingTmp =trainingSet[folds !=s,]
+ testTmp =trainingSet[folds==s,]
+ testLabsTmp =trainingLabs[folds==s]
+ # fit the NN model
+ nnetTmp=neuralnet(Species~., trainingTmp, hidden = c(5,4),
+ act.fct = "logistic",linear.output = FALSE)
+ nnModels[[s]]=nnetTmp
+ ypred = neuralnet::compute(nnetTmp, testTmp)
+ yhat = ypred$net.result
+ # assign labels
+ SpeciesEst=data.frame(
+ "labelEst"=ifelse(max.col(yhat[ ,1:3])==1,"setosa",
+ ifelse(max.col(yhat[ ,1:3])==2,
+ "versicolor", "virginica")))
+ SpeciesEst=factor(SpeciesEst[,1])
+ nOfMissObs= sum(1-as.numeric(testLabsTmp==SpeciesEst))
+ terror=nOfMissObs/length(testLabsTmp) # test error
+ testErrorCV[s]=terror
+ } # end of loop
nnModelsCross-validated test error:
> testErrorCV
[1] 0.05263158 0.00000000 0.03448276 0.08000000 0.00000000
> mean(testErrorCV)
[1] 0.03342287
> sd(testErrorCV)
[1] 0.034546
Optimal NN model:
> optNNnumber=min(which(testErrorCV==min(testErrorCV)))
> optNNnumber
[1] 2
> # extract and save optimal NN model
> optNNModel=nnModels[[optNNnumber]]
Apply optimal NN model to test set:
> NNOptPred=neuralnet::compute(nnModels[[optNNnumber]],testSet)
> yhatOPt= NNOptPred$net.result
> SpeciesEstOpt=data.frame(
+ "labelEst"=ifelse(max.col(yhatOPt[ ,1:3])==1,"setosa",
+ ifelse(max.col(yhatOPt[ ,1:3])==2,
+ "versicolor", "virginica")))
> SpeciesEstOpt=factor(SpeciesEstOpt[,1])
> nOfMissObsOPt= sum(1-as.numeric(testLabs==SpeciesEstOpt))
> terrorOPt=nOfMissObsOPt/length(testLabs) # test error
> table(SpeciesEstOpt,testLabs)
testLabs
SpeciesEstOpt setosa versicolor virginica
setosa 10 0 0
versicolor 0 9 0
virginica 0 1 10
> testSet$Species=testLabs; testSet$EstSpecies=SpeciesEstOpt
> library(ggplot2); ggplot(testSet,aes(Sepal.Width,Petal.Length))+
+ geom_point(aes(shape=EstSpecies,color=Species))+theme_bw()

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12