Stat 435 Lecture Notes 4b

Xiongzhi Chen

Washington State University

Ridge regression: basics

Overview

Ridge regression

  • applies regardless of magnitues of sample size \(n\) and number of predictors \(p\)
  • shrinks estimated coefficients compared to least squares estimate (LSE)
  • is able to produce LSE
  • presents a way to estimate coefficients when \(n < p+1\)

Ridge regression often produces a biased estimate with smaller variance than LSE, and it is a bias-variance trade-off technique.

Settings

  • Model: \(Y=\beta_0+\beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon\)

  • Observations: \((y_i,x_{i1},x_{i2},\ldots,x_{ip}),i=1,\ldots,n\), where \(y_i\) is the \(i\)th observation for \(Y\) and \(x_{ij}\) that for \(X_j\)

  • Estimate \(\hat{\boldsymbol{\beta}}=(\hat{\beta}_1,\ldots,\hat{\beta}_p)\) of \({\boldsymbol{\beta}}=({\beta}_1,\ldots,{\beta}_p)\), and \(\hat{\beta}_0\) of \({\beta}_0\)

  • Fitted model: \(\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_{i1} + \hat{\beta}_2 x_{i2} + \ldots + \hat{\beta}_p x_{ip}\)

  • Residuals: \(e_i=y_i - \hat{y}_i\)

Optimization

  • The ridge estimate \(\hat{\boldsymbol{\beta}}^R_{\lambda}=(\hat{\beta}_1,\ldots,\hat{\beta}_p)\) is the \({\boldsymbol{\beta}}=({\beta}_1,\ldots,{\beta}_p)\) that minimizes \[ L_2(\beta_0,\boldsymbol{\beta},\lambda)= \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{i=1}^p \beta_i^2 \]
  • \(\sqrt{\sum_{i=1}^p \beta_i^2}\) is written as \(\Vert\boldsymbol{\beta}\Vert_2\), i.e., the \(l_2\)-norm of \(\boldsymbol{\beta}\)
  • \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\) is just the RSS
  • \(\lambda \sum_{i=1}^p \beta_i^2\) is called a regularizer or penalty
  • \(L_2(\beta_0,\boldsymbol{\beta},\lambda)\) is referred to as an \(l_2\)-regularized RSS
  • no penalty on \(\beta_0\)

Intercept term

  • Recall \(\mathbf{x}_i=(x_{i1},x_{i2},\ldots,x_{ip}),i=1,\ldots,n\), where \(x_{ij}\) is the \(i\)th observation for \(X_j\)
  • Let \(\mathbf{X}\) be the \(n \times p\) matrix whose \(i\)th row is \(\mathbf{x}_i\)
  • \(\mathbf{X}\) is called a design matrix (when factors as predictors are properly coded)

If the variables \(X_i,i=1,\ldots,p\) are centered, i.e., the columns of \(\mathbf{X}\), to have mean zero before ridge regression is performed, then the estimated intercept will be \[\hat{\beta}_0=\bar{y}=\sum_{i=1}^n y_i/n\]

Solution path

Recall the objective function \[ L_2(\beta_0,\boldsymbol{\beta},\lambda)= \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{i=1}^p \beta_i^2 \] and its solution \(\hat{\boldsymbol{\beta}}^R_{\lambda}=(\hat{\beta}_1,\ldots,\hat{\beta}_p)\)

  • no penalty on estimated \(\beta_0\)
  • \(\lambda=0\) gives the LSE, and \(\lambda=\infty\) forces \(\hat{\boldsymbol{\beta}}^R_{\lambda}=0\)
  • some \(0 < \lambda < \infty\) strikes a balance between LSE and \(\hat{\boldsymbol{\beta}}^R_{\lambda}=0\)

Note: The ridge solution has explicit reprentation.

Bias-variance trade-off

Ridge regression works best in situations where the least squares estimates have high variance:

  • \(\lambda=0\): ridge estimate is the LSE and is unbiased
  • \(\lambda >0\): bias increases and variance decreases usually
  • \(\lambda=\infty\): estimated coefficients are all zero

Bias-variance trade-off

\(p=45\) predictors and \(n=50\) observations; all \(\beta_j \ne 0\)

Select tuning parameter

The optimal value \(\lambda^{\ast}\) of the tuning parameter \(\lambda\) is ofen determined by \(k\)-fold cross-validation:

  • Pick a sequence of \(s\) values for \(\lambda\) as \(\lambda_1, \lambda_2, \ldots, \lambda_s\)
  • For each \(\lambda_l\), apply \(k\)-fold cross-validation to estimate the test error of the corresponding model
  • Set \(\lambda^{\ast}\) as the value among the \(s\) values of \(\lambda\) for which the estimated test error is the smallest

Ridge regression: special case

Special model

  • Assume the design matrix \(\mathbf{X}=\mathbf{I}_p\), i.e., the identity matrix
  • Consider a linear regression model without an intercept, i.e., forcing \(\beta_0=0\)

If \(n=p\) and \(\mathbf{X}=\mathbf{I}_p\), then we have a very special model: \[y_j = \beta_j+\varepsilon_j,j=1,\ldots,p\]

Note: \(\mathbf{X}=\mathbf{I}_p\) is referred to as an orthogonal design

LSE and ridge estimate

For the special model \[y_j = \beta_j+\varepsilon_j,j=1,\ldots,p,\]

  • the LSE (i.e., least squares estimate) is \[\hat{\beta}_j=y_j\]
  • the \(l_2\)-regularized RSS becomes \[ L_2(\beta_{0},\boldsymbol{\beta},\lambda)= \sum_{j=1}^p (y_i - \beta_j)^2 + \lambda \sum_{i=1}^p \beta_i^2 \] and the ridge estimate is \[\hat{\beta}_{j,\lambda}^R=y_j/(1+\lambda)\]

Three estimators

Ridge regression: application

Scale equivariance

  • Recall the design matrix \(\mathbf{X}\), whose \((i,j)\)th entry \(x_{ij}\) is the \(i\)th observation of predictor \(X_j\)
  • Ridge regression estimates depend on the scale of predictors
  • Before applying ridge regression, it is best to standardize predictors as follows \[ \tilde{x}_{ij}= \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n (x_{ij}-\bar{x}_j)^2}} \] where \(\bar{x}_j=n^{-1}\sum_{i=1}^n x_{ij}\)
  • The resulting ridge gression estimates are called standardized estimates

Illustration

Modelling the Credit data set

  • Response Balance
  • Predictors Income, Limit, Rating and Student

Illustration

Inference

  • The method of bias correction can be used to conduct inference on ridge estimates
  • Canonical assumptions involve Gaussian random errors
  • Asymptotic theory on testing coefficients is based on Gaussian limiting distributions
  • P-values from testing can be obtained

Note: “High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi” by Ruben Dezeure, Peter Buhlmann, Lukas Meier and Nicolai Meinshausen

Inference

  • Tesing \(H_{j0}:\beta_j =0\) versus \(H_{j1}:\beta_j \ne 0\)
  • P-values for testing if the coefficient of each of Income, Limit, Rating and Student:
       Income         Limit        Rating    StudentYes 
4.878273e-229  4.712512e-15  2.502320e-18 3.116377e-128 

LASSO: basics

Settings

  • Model: \(Y=\beta_0+\beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon\)

  • Observations: \((y_i,x_{i1},x_{i2},\ldots,x_{ip}),i=1,\ldots,n\), where \(y_i\) is the \(i\)th observation for \(Y\) and \(x_{ij}\) that for \(X_j\)

  • Estimate \(\hat{\boldsymbol{\beta}}=(\hat{\beta}_1,\ldots,\hat{\beta}_p)\) of \({\boldsymbol{\beta}}=({\beta}_1,\ldots,{\beta}_p)\), and \(\hat{\beta}_0\) of \({\beta}_0\)

  • Fitted model: \(\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_{i1} + \hat{\beta}_2 x_{i2} + \ldots + \hat{\beta}_p x_{ip}\)

  • Residuals: \(e_i=y_i - \hat{y}_i\)

Optimization

  • The LASSO estimate \(\hat{\boldsymbol{\beta}}^L_{\lambda}=(\hat{\beta}_1,\ldots,\hat{\beta}_p)\) is the \({\boldsymbol{\beta}}=({\beta}_1,\ldots,{\beta}_p)\) that minimizes \[ L_1(\beta_0,\boldsymbol{\beta},\lambda)= \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{i=1}^p\vert \beta_i \vert \]
  • \(\sum_{i=1}^p \vert\beta_i\vert\) is written as \(\Vert\boldsymbol{\beta}\Vert_1\), i.e., the \(l_1\)-norm of \(\boldsymbol{\beta}\)
  • \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\) is just the RSS
  • \(\lambda \sum_{i=1}^p \vert\beta_i\vert\) is called a regularizer or penalty
  • \(L_1(\beta_0,\boldsymbol{\beta},\lambda)\) is referred to as an \(l_1\)-regularized RSS
  • no penalty on \(\beta_0\)

Solution path

Recall the objective function \[ L_1(\beta_0,\boldsymbol{\beta},\lambda)= \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{i=1}^p \vert \beta_i \vert \] and its solution \(\hat{\boldsymbol{\beta}}^L_{\lambda}=(\hat{\beta}_1,\ldots,\hat{\beta}_p)\)

  • no penalty on estimated \(\beta_0\)
  • \(\lambda=0\) gives the LSE, and \(\lambda=\infty\) forces \(\hat{\boldsymbol{\beta}}^L_{\lambda}=0\)
  • some \(0 < \lambda < \infty\) strikes a balance between LSE and \(\hat{\boldsymbol{\beta}}^L_{\lambda}=0\)
  • some \(\hat{\beta}_j\)’s,\(j=1,\ldots,p,\) can be exactly zero

Bias-variance trade-off

LASSO works best in situations where some coefficeints are extactly zero:

  • \(\lambda=0\): LASSO estimate is the LSE and is unbiased
  • \(\lambda >0\): bias increases and variance decreases usually, and some estimated coefficients are zero
  • \(\lambda=\infty\): estimated coefficients are all zero

Select tuning parameter

The optimal value \(\lambda^{\ast}\) of the tuning parameter \(\lambda\) is ofen determined by \(k\)-fold cross-validation:

  • Pick a sequence of \(s\) values for \(\lambda\) as \(\lambda_1, \lambda_2, \ldots, \lambda_s\)
  • For each \(\lambda_l\), apply \(k\)-fold cross-validation to estimate the test error of the corresponding model
  • Set \(\lambda^{\ast}\) as the value among \(\lambda\) as \(\lambda_1, \lambda_2, \ldots, \lambda_s\) for which the estimated test error is the smallest

LASSO estimate: special case

Special model

  • Recall \(\mathbf{x}_i=(x_{i1},x_{i2},\ldots,x_{ip}),i=1,\ldots,n\), where \(x_{ij}\) is the \(i\)th observation for \(X_j\)
  • Let \(\mathbf{X}\) be the \(n \times p\) matrix whose \(i\)th row is \(\mathbf{x}_i\)
  • Consider a linear regression model without an intercept, i.e., forcing \(\beta_0=0\)

If \(n=p\) and \(\mathbf{X}=\mathbf{I}_p\), then we have a very special model: \[y_j = \beta_j+\varepsilon_j,j=1,\ldots,p\]

LSE and LASSO estimate

For the special model \[y_j = \beta_j+\varepsilon_j,j=1,\ldots,p,\]

  • the LSE (i.e., least squares estimate) is \[\hat{\beta}_j=y_j\]
  • the \(l_1\)-regularized RSS becomes \[ L_1(\beta_{0},\boldsymbol{\beta},\lambda)= \sum_{j=1}^p (y_i - \beta_j)^2 + \lambda \sum_{i=1}^p \vert\beta_i\vert \]

LSE and LASSO estimate

The LASSO estimate, more complicated than LSE and ridge estimate, is:

\[\hat{\beta}_{j,\lambda}^L = \left\{ \begin{array} {lll} 0 & \text{if} & \vert y_j \vert \le \lambda/2 \\ y_j -\lambda/2 & \text{if} & y_j > \lambda/2 \\ y_j +\lambda/2 & \text{if} & y_j < -\lambda/2 \end{array}\right. \] Note: compare the above with LSE \(\hat{\beta}_j=y_j\) and ridge estimate \[\hat{\beta}_{j,\lambda}^R=y_j/(1+\lambda)\]

Three estimators

LASSO: application

Illustration

Modelling the Credit data set

  • Response Balance
  • Predictors Income, Limit, Rating and Student

Illustration

Illustration

\(p=45\) predictors and \(n=50\) observations; 43 \(\beta_j\)’s are 0

Inference

  • The method of bias correction can be used to conduct inference on LASSO estimates
  • Canonical assumptions involve Gaussian random errors
  • Asymptotic theory on testing coefficients is based on Gaussian limiting distributions
  • P-values from testing can be obtained

Note: “High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi” by Ruben Dezeure, Peter Buhlmann, Lukas Meier and Nicolai Meinshausen

Inference

  • Tesing \(H_{j0}:\beta_j =0\) versus \(H_{j1}:\beta_j \ne 0\)
  • P-values for testing if the coefficient of each of Income, Limit, Rating and Student:
       Income         Limit        Rating    StudentYes 
2.735628e-255  5.462397e-83 2.106189e-108 5.110140e-130 

Ridge and LASSO estimates

Three optimizations

Three estimators

Ridge and LASSO estimates

\(p=45\) predictors and \(n=50\) observations; all \(\beta_j \ne 0\)

Ridge and LASSO estimates

\(p=45\) predictors and \(n=50\) observations; 43 \(\beta_j\)’s are 0

Ridge and LASSO regressions

  • Ridge regression performs better when the response is a function of many predictors, all with coefficients of roughly equal size
  • LASSO performs better when a relatively small number of predictors have substantial coefficients and the remaining predictors have very small or zero coefficients
  • Both ridge and LASSO regression yield a reduction in variance at the expense of a small increase in biases, when the LSE have excessively high variance
  • LASSO performs variable selection while ridge regression does not

License and session Information

License

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] knitr_1.21

loaded via a namespace (and not attached):
 [1] compiler_3.5.0  magrittr_1.5    tools_3.5.0    
 [4] htmltools_0.3.6 revealjs_0.9    yaml_2.2.0     
 [7] Rcpp_1.0.3      stringi_1.2.4   rmarkdown_1.11 
[10] stringr_1.3.1   xfun_0.4        digest_0.6.18  
[13] evaluate_0.12