
Note: there are many ways to obtain an estimate \(\hat{\boldsymbol{\beta}}\) of \({\boldsymbol{\beta}}\)
Training error is a measure on the performance of a model when it is fitted/trained from a set of observations that we call the training set. Namely, training error measures how well a fitted/trained model fits/learns the training set.
A commonly used training error is the residual sum of squares (RSS) (though other choices are available)
For example, if we take the set of \(n\) observations as the training set and use RSS as the training error, then the least squares estimate (LSE) minimizes \(\textrm{RSS}=\sum_{i=1}^m e_i^2\)
Note: other measures of test error are available
Note: Modern wisdom, dubbed as the “double descent curve”, was discovered by Dr. Belkin and his coauthors

Image credit: Belkin et al; doi.org/10.1073/pnas.1903070116
Cross-validation is a resampling technique to estimate test error, and is often implemented as follows:

With \(n\) observations \(\mathbf{z}_i = (y_i,\mathbf{x}_i)\),
Randomly split the \(n\) observations into a training set \(\mathcal{T}_1\) with \(n_1\) observations, and a validation set \(\mathcal{V}_1\) with \(n_2=n-n_1\) observations
Fit model \(M_l\) using \(\mathcal{T}\), apply fitted model \(\hat{M}_l\) to predict responses in \(\mathcal{V}_1\), and compute the mean squared error (MSE) \[\textrm{MSE}(\mathcal{V}_1)=n_2^{-1} \sum_{y_i \in \mathcal{V}_1} (y_i -\hat{y}_i)^2,\] where \(\hat{y}_i\) is the fitted value for \(y_i\)
Note: \(\mathcal{T}_1\) and \(\mathcal{V}_1\) are disjoint
Note: the above steps give an estimate of test error of model \(M_l\)
Note: the above gives the best model among a set of models

The \(k\)-fold CV for model \(M_l\) is implemented as follows:
Settings:
Model: \(Y=\beta_0+\beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon\)
Observations: \((y_i,x_{i1},x_{i2},\ldots,x_{ip}),i=1,\ldots,n\), where \(y_i\) is the \(i\)th observation for \(Y\) and \(x_{ij}\) that for \(X_j\)
Estimate \(\hat{\boldsymbol{\beta}}=(\hat{\beta}_0,\hat{\beta}_1,\ldots,\hat{\beta}_p)\) of \({\boldsymbol{\beta}}=({\beta}_0,{\beta}_1,\ldots,{\beta}_p)\)
How to obtain an estimate \(\hat{\boldsymbol{\beta}}\) of \({\boldsymbol{\beta}}\) depends critically on
When the number of parameters is not larger than the sample size:
When the number of parameters is not larger than the sample size but there are many potential predictors, we often desire a small model that is easy to interpret and perform well. Namely, we still need to consider:
Variable or model selection is needed in the classic scenario when there are many potential predictors.
When the number of parameters is bigger than the sample size,
In this scenario, we have a few choices:
Consider a linear model with \(p\) predictors: \[ Y=\beta_0+\beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon \]
Best subset selection is a “brute-force” method that checks each of the \(2^p\) possible linear submodels and picks the best one under some criterion
Best subset selection indeed gives the best subset of predictors (in terms of linear model) among all \(p\) predictors under a criterion
Some criteria for subset/variable selection:


Forward stepwise selection



For best subset, forward stepwise, and backwards stepwise selections, we need to select a best model from the best submodels. However, neither training set RSS nor training set \(R^{2}\) can be used for this purpose since
So, a practical way to select a single best model is to balance the RSS on the training error and the model size
Four methods to choose the optimal model:
Mallow’s
\[\begin{equation} C_{p}=\frac{1}{n}\left( \textrm{RSS}+2d\hat{\sigma}^{2}\right) \end{equation}\]Remark: If \(\hat{\sigma}^{2}\) is an unbiased estimate of \(\sigma^{2}\), then \(C_{p}\) is an unbiased estimate of test MSE
The intuition behind the adjusted \(R^{2}\) is that “once all of the correct variables have been included in the model, adding additional noise variables will lead to a very small decrease in RSS”
Let \(d\) be number of predictors in model, \(n\) sample size, and \(\hat{\sigma}^{2}\) an estimate of \(\sigma^2=\textrm{Var}(\varepsilon)\):
\(C_{p}=\frac{1}{n}\left( \textrm{RSS}+2d\hat{\sigma}^{2}\right)\)
\(\textrm{AIC}=\frac{1}{n\hat{\sigma}^{2}}\left( RSS+2d\hat{\sigma}^{2}\right)\)
\(\textrm{BIC}=\frac{1}{n\hat{\sigma}^{2}}\left( \textrm{RSS}+d\hat{\sigma}^{2}\log n\right)\)
\(\text{Adjusted }R^{2}=1-\frac{\textrm{RSS}/\left( n-d-1\right) }{\textrm{TSS}/\left(n-1\right) }\)
All formulae for \(C_{p}\), AIC and BIC are for a linear model fit using least squares; \(C_p\), AIC and BIC all have good theoretical justifications

Recall CV for model selection:
The \(k\)-fold CV for model \(M_l\) is implemented as follows:
Guideline:

Different random splitting schemes often lead to differential optimal models. So,
first, we can calculate the standard error of the estimated test MSE for each model size, by repeatedly validating ``the best model’’ of this model size
then select the smallest model for which the estimated test error is within one standard error of the lowest point on the front curve for the estimated MSEs of ``the best models’’
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.3 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12