Stat 435 Lecture Notes 5

Xiongzhi Chen

Washington State University

Logistic regression: model

Motivation

Logistic regression is used to model the probability of the status of a binary variable, such as

  • Failure or success of an experiment
  • Deficit or surplus of an asset return
  • Win or lose of a game

Coding a binary variable

Suppose we are trying to predict the medical condition \(Y\) of a patient in the emergency room on the basis of her symptoms

  • Two diagnoses for \(Y\): “stroke”and “drug overdose”
  • Code \(Y\) into \[ Y=\left\{ \begin{array} [c]{ccc} 0 & \text{if} & \text{stroke}\\ 1 & \text{if} & \text{drug overdose} \end{array} \right. \]

Note: the roles of 1 and 0 for \(Y\) can be switched, and that different coding for \(Y\) can be used

Modeling a binary variable

  • Suppose we have two predictors \(X_1\) and \(X_2\), a naive way to model \(Y\) is by a linear mode \[\begin{equation} Y=\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\varepsilon, \end{equation}\] so that drug overdose is predicted if \(\hat{Y}>0.5\) and stroke otherwise.
  • Issue with the above modelling strategy: a predicted value \(\hat{Y}\) can lie outside the set of values used for coding

The logistic model

Consider the relationship between if a customer will default on his credit card debt based on his balance

  • Set \(Y\) (Yes as 1 or No as 0) as the default status of a customer
  • Model the probability \[ p=\Pr\left(Y=1 \vert\text{balance}\right) =p\left( \text{balance}\right) \] as a function of balance
  • We may predict default=Yes for an individual for whom \(p\left( \text{balance}\right) >0.5\).

The logistic model

Let \(X\) denote balance. We

  • should not consider the model \(p\left( X\right) =\beta_{0}+\beta_{1}X\)
  • should model the logit of \(p\left(X\right)\), i.e., \[\begin{equation} \operatorname{logit}\left( p\left( X\right) \right) =\log\left( \frac{p\left( X\right) }{1-p\left( X\right) }\right) \end{equation}\] via \[ \operatorname{logit}\left( p\left( X\right) \right) =\beta_{0}+\beta_{1}X \]

Note: \(p\left( X\right) /\left( 1-p\left( X\right) \right)\) is called odds; no random error term in the above model since we are modeling a probability; \(p\left(X\right) =E\left( Y=1|X\right)\)

The logsitic model

  • Equivalence: \(\operatorname{logit}\left( p\left( X\right) \right) =\beta_{0}+\beta_{1}X\) is equivalent to \[\begin{equation} p\left( X\right) =\frac{\exp\left( \beta_{0}+\beta_{1}X\right) } {1+\exp(\beta_{0}+\beta_{1}X)} \end{equation}\]
  • The logit function \[ f\left( t\right) =\frac{e^{t}}{1+e^{t}} \] is a sigmoid function that has an asymptote at \(1\) as \(t\rightarrow\infty\) and at \(0\) as \(t\rightarrow-\infty\).

The logistic model

Logistic regression: estimation and inference

Estimated coefficients

  • For the linear model \[ Z=\beta_{0}+\beta_{1}X+\varepsilon, E\left( \varepsilon\right) =0 \] the coefficient \(\beta_{1}\) is the relative unit change in \(E\left( Z\right)\)
  • For model \[ \operatorname{logit}\left( p\left( X\right) \right)=\log\left( \frac{p\left( X\right) }{1-p\left( X\right) }\right) =\beta_{0}+\beta_{1}X \] the coefficient \(\beta_{1}\) is the relative unit change in the log-odds, i.e., \[ \log\left( \frac{p\left( x_{1}\right) }{1-p\left( x_{1}\right) }\right) -\log\left( \frac{p\left( x_{2}\right) }{1-p\left( x_{2}\right) }\right) =\beta_{1}\text{ when }x_{1}-x_{2}=1 \]

Estimating coefficients

Consider the logistic regression model

\[\begin{equation} \log\left( \frac{p\left( x_{i}\right) }{1-p\left( x_{i}\right) }\right)=\beta_{0}+\beta_{1}x_{i} \end{equation}\]
  • the \(i\)th response \(y_{i}\) given \(x_{i}\) is a Bernoulli random variable with probability of success \(p\left( x_{i}\right) =\Pr\left( y_{i}=1|x_{i}\right)\) and \[\begin{equation} p\left( x_{i}\right) =\frac{\exp\left( \beta_{0}+\beta_{1}x_{i}\right) }{1+\exp\left( \beta_{0}+\beta_{1}x_{i}\right) } \end{equation}\]
  • the probability mass function (PMF) of \(y_{i}\) given \(x_{i}\) is \[ f\left( y_{i}\right) =p_{i}^{y_{i}}\left( 1-p_{i}\right)^{1-y_{i}} \]

Estimating coefficients

The maximum likelihood estimation (MLE) method:

  • When \(\left\{ y_{i}\right\} _{i=1}^{n}\) independent given \(\left\{ x_{i}\right\}_{i=1}^{n}\), the joint PMF of \(\left\{ y_{i}\right\}_{i=1}^{n}\), i.e., the likelihood function of \(\left(\beta_{0},\beta_{1}\right)\), is \[\begin{align*} L\left( \beta_{0},\beta_{1}\right) & =\prod\nolimits_{i=1}^{n}f\left( y_{i}\right) =\prod\nolimits_{i=1}^{n}p_{i}^{y_{i}}\left( 1-p_{i}\right) ^{1-y_{i}} \end{align*}\]
  • If we maximize \(L\left(\beta_{0},\beta_{1}\right)\) with respect to \(\left(\beta_{0},\beta_{1}\right)\), then we can equivalently maximize \[\begin{equation} \log L\left( \beta_{0},\beta_{1}\right)=\sum_{i=1}^{n}\left[ y_{i}\log p_{i}+\left( 1-y_{i}\right) \log\left( 1-p_{i}\right) \right] \end{equation}\]

Note: not minimize \(RSS=\sum_{i=1}^{n}\left[ y_{i}-\left( \beta_{0}+\beta_{1}x_{i}\right)\right]^{2}\)

Estimating coefficients

  • Optimization of \(L\left(\beta_{0},\beta_{1}\right)\) is usually implemented by gradient descent using the gradient of \(\log L\left(\beta_{0},\beta_{1}\right)\) or \(L\left(\beta_{0},\beta_{1}\right)\)

  • The algorithm is referred to as the “iteratively reweighted least squares (IWLS)”

Inference on coefficients

Inference on the estimate \(\left(\hat{\beta}_{0},\hat{\beta}_{1}\right)\) is often based on asymptotic theory. Under appropriate conditions, we have the following:

  • Let \(s_{1}\) be the standard deviation of \(\hat{\beta}_{1}\). Then \[\begin{equation} \frac{\hat{\beta}_{1}}{s_{1}}\rightarrow\text{Gaussian}\left( 0,1\right) \text{ }\ \text{as }n\rightarrow\infty\text{ if }\beta_{1}=0 \end{equation}\]
  • Let \(s_{0}\) be the standard deviation of \(\hat{\beta}_{0}\). Then \[\begin{equation} \frac{\hat{\beta}_{0}}{s_{0}}\rightarrow\text{Gaussian}\left( 0,1\right) \text{ }\ \text{as }n\rightarrow\infty\text{ if }\beta_{0}=0 \end{equation}\]

Estimation and inference

Estimation and inference

Logistic regression: predicition

Making predictions

  • Recall \[ p\left( x_{i}\right) =\frac{\exp\left( \beta_{0}+\beta_{1}x_{i}\right) }{1+\exp\left( \beta_{0}+\beta_{1}x_{i}\right) } \]

  • Once the estimate \(\left( \hat{\beta}_{0},\hat{\beta}_{1}\right)\) is obtained, the predicted value of \(p\left( x_{i}\right)\) is given by \[ \hat{p}\left( x_{i}\right) =\frac{\exp\left( \hat{\beta}_{0}+\hat{\beta }_{1}x_{i}\right) }{1+\exp\left( \hat{\beta}_{0}+\hat{\beta}_{1}% x_{i}\right) }% \]

Making predictions

  • When balance \(X=1000\), we have \[ \hat{p}\left(Y=1 \vert X=1000\right) =\frac{\exp\left( -10.65+0.0055\times1000\right) }{1+\exp\left( -10.65+0.0055\times1000\right) }=0.006 \]

  • When \(X\) denotes if an individual is a student (\(X=1\)) or not (\(X=0\)), we have \[ \hat{p}\left(Y=1 \vert X\text{=Yes}\right) =\frac{\exp\left( -3.504+0.4049\times1\right) }{1+\exp\left( -3.504+0.4049\times1\right) }=0.043 \] and \[ \hat{p}\left(Y=1\vert X\text{=No}\right) =\frac{\exp\left( -3.504+0.4049\times0\right) } {1+\exp\left( -3.504+0.4049\times0\right) }=0.029 \]

The multiple linear logistic regression model

The model

When there are \(p\) predictors \(X_j,j=1,\ldots,p\), whose \(i\)th observation is \(x_{ij}\),

  • the multiple linear logistic regression model takes the form \[ \log\left( \frac{p\left( \mathbf{x}_{i}\right) }{1-p\left( \mathbf{x}_{i}\right) }\right) =\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\cdots+\beta_{p}x_{ip} \] for \(i=1,2,\cdots,n\) with \(\mathbf{x}_{i}=(x_{i1},\ldots,x_{ip})\)
  • the model is equivalent to \[ p\left( \mathbf{x}_{i}\right) =\frac{\exp\left( \beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\cdots+\beta_{p}x_{ip}\right) } {1+\exp\left( \beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\cdots+\beta_{p}x_{ip}\right) } \]

Estimation and inference

  • The maximum likelihood method is also used to estimate the coefficient vector \(\beta=\left( \beta_{0},\beta_{1},\ldots,\beta_{p}\right)\)

  • Gaussian asymptotic theory for \(\hat{\beta}=\left( \hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta}_{p}\right)\) is used to conduct inference on components of \(\beta\)

  • For example,for each \(i\in\left\{ 0,1,\ldots,p\right\}\), let \(s_{i}\) be the standard deviation of \(\hat{\beta}_{i}\), then (under appropriate conditions) as the sample size \(n\rightarrow\infty\), \[\begin{equation} \frac{\hat{\beta}_{i}}{s_{i}}\rightarrow\text{Gaussian}\left( 0,1\right) \text{ if }\beta_{i}=0 \end{equation}\]

Making predictions

Similar to simple logistic regression

Estimation and inference

Estimation and inference

Penalized logistic regression: model

Motivation

Consider using genetics to predict a trait. Suppose that there are \(n\) individuals, each with \(m\) measured SNP genotypes.

  • The genotype for SNP \(j\) in individual \(i\) is denoted by \(x_{ij}\in\left\{ 0,1,2\right\}\), \(i=1,2,...,n\) and \(j=1,2,...,m\). These SNP genotypes are stored in an \(n\times m\) matrix \(\mathbf{X}\) whose \(\left( i,j\right)\) entry is \(x_{ij}\)
  • Denote the genotypes for individual \(i\) by \(\mathbf{x}_i = \left(x_{i1},x_{i2},...,x_{im}\right)\)
  • Assume a trait (either quantitative or binary) has been measured on each individual, which is denoted by \(y_{i},i=1,...,n\)

Motivation

  • For \(y_i \in \{0,1\}\), consider the following logistic model \[ \log\left( \frac{p_{i}}{1-p_{i}}\right) =\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\cdots+\beta_{m}x_{im} \] for \(i=1,...,n\), where \[ p_{i}=\Pr\left( y_{i}=1|x_{i1},...,x_{im}\right) \]

  • Written in the population version, the above becomes \[ \log\left( \frac{p\left( X\right) }{1-p\left( X\right) }\right) =\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\cdots+\beta_{m}X_{m} \] for \(j=1,...,n\), where \(X=\left( X_{1},...,X_{m}\right)\) and \[ p\left( X\right) =\Pr\left( Y=1|X_{1},...,X_{m}\right) \]

Motivation

  • Often \(m\gg n\) since \(m\) is usually in the millions but \(n\) in the thousands
  • It is widely known that only a few SNPs, i.e., \(X_{i}\)’s, that are associated with \(Y\), and that the SNPs are usually dependent
  • Assume \(\left\{y_{i}\right\}_{i=1}^{n}\) are independent given \(\mathbf{X}\), we have the likelihood function for \(\beta=\left( \beta_{0},\beta_{1},\ldots,\beta_{m}\right)\) as \[ L\left( \beta\right) =\prod\nolimits_{i=1}^{n}p_{i}^{y_{i}}\left( 1-p_{i}\right) ^{1-y_{i}} \] where \[ p_{i}=\frac{\exp\left( \beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\cdots +\beta_{m}x_{im}\right) }{1+\exp\left( \beta_{0}+\beta_{1}x_{i1}+\beta _{2}x_{i2}+\cdots+\beta_{m}x_{im}\right) } \]

Penalized logistic regression

We can perform:

  • LASSO logistic regression when some \(\beta_{i}\)’s are zero: \[ \hat{\beta}=\left( \hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta} _{m}\right) \in\operatorname*{argmin}_{\beta}\left[ -\log L\left( \beta\right) +\lambda\sum_{j=1}^{m}\left\vert \beta_{j}\right\vert \right] \]
  • Ridge logistic regression: \[ \hat{\beta}=\left( \hat{\beta}_{0},\hat{\beta}_{1},\ldots,\hat{\beta} _{m}\right) \in\operatorname*{argmin}_{\beta}\left[ - \log L\left( \beta\right) +\lambda\sum_{j=1}^{m}\left\vert \beta_{j}\right\vert ^{2}\right] \]

Note: \(\operatorname*{argmin}_{\beta}\) refers to optimal \(\beta^{\ast}\) which minimizes the corresponding objective function

Choosing optimal \(\lambda\)

LASSO logistic regression based on training set:

[1] 0.1584858

Estimated coefficients

LASSO logistic regression and nonzero estimated coefficients based on optimal \(\lambda=0.04308576\):

[1] 0.2876821

Prediction

Predicted probabilities on test set that has 30 observations:

 [1] 0.5714286 0.5714286 0.5714286 0.5714286 0.5714286
 [6] 0.5714286 0.5714286 0.5714286 0.5714286 0.5714286
[11] 0.5714286 0.5714286 0.5714286 0.5714286 0.5714286
[16] 0.5714286 0.5714286 0.5714286 0.5714286 0.5714286
[21] 0.5714286 0.5714286 0.5714286 0.5714286 0.5714286
[26] 0.5714286 0.5714286 0.5714286 0.5714286 0.5714286

Predicted class labels at threshold 0.5 and classification table:

         
PredClass  0  1
        1 14 16

Inference

  • The method of bias correction can be used to conduct inference on penalized, shrinkage estimates
  • Canonical assumptions involve Gaussian random errors
  • Asymptotic theory on testing coefficients is based on Gaussian limiting distributions
  • P-values from testing can be obtained

Note: “High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi” by Ruben Dezeure, Peter Buhlmann, Lukas Meier and Nicolai Meinshausen

Inference: lasso logistic regression

  • Testing \(H_{j0}:\beta_j =0\) versus \(H_{j1}:\beta_j \ne 0\)
  • P-value for testing if the coefficient of a predictor is zero:
 [1] 0.13413996 0.05207436 0.39755514 0.23921034 0.65316441
 [6] 0.67885948 0.56912208 0.16846145 0.56589285 0.96587353
[11] 0.78957391 0.05460193 0.18388693 0.44317876 0.12468140
[16] 0.92400246 0.70416327 0.04088294 0.89933977 0.97099599
[21] 0.49808498 0.46499122 0.25293188 0.28736267 0.64293125
[26] 0.63676763 0.99339691 0.53668426 0.59440224 0.56993170
[31] 0.62692600 0.94943853 0.61880901 0.29636506 0.39771952
[36] 0.68962920 0.38298890 0.13892930 0.50686087 0.06816067
[41] 0.08304863 0.82547452 0.44883142 0.20457717 0.97508134
[46] 0.90046281 0.04752561 0.76775545 0.85050516 0.79927070

Inference: ridge logistic regression

  • Testing \(H_{j0}:\beta_j =0\) versus \(H_{j1}:\beta_j \ne 0\)
  • P-value for testing if the coefficient of a predictor is zero:
 [1] 0.63598833 0.06612030 0.61192496 0.94142750 0.54331824
 [6] 0.48610506 0.68454333 0.26190299 0.96993506 0.76290601
[11] 0.95534631 0.23480084 0.70366376 0.52505187 0.05438521
[16] 0.76055117 0.34191172 0.01553583 0.72333222 0.96649906
[21] 0.76586988 0.45826480 0.31636520 0.15498768 0.19719119
[26] 0.49091761 0.85448689 0.37245537 0.72834464 0.40307678
[31] 0.66596097 0.52949698 0.48666810 0.70651168 0.45695047
[36] 0.23772894 0.46093631 0.21519215 0.93828245 0.13029887
[41] 0.91628901 0.56170626 0.30576696 0.20412352 0.85898080
[46] 0.81970855 0.21620605 0.94463983 0.63989905 0.86938832

License and session Information

License

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] knitr_1.21

loaded via a namespace (and not attached):
 [1] compiler_3.5.0  magrittr_1.5    tools_3.5.0    
 [4] htmltools_0.3.6 revealjs_0.9    yaml_2.2.0     
 [7] Rcpp_1.0.0      stringi_1.2.4   rmarkdown_1.11 
[10] stringr_1.3.1   xfun_0.4        digest_0.6.18  
[13] evaluate_0.12