Stat 435 Lecture Notes 2

Xiongzhi Chen

Washington State University

Simple linear regression

Overview

Simple linear regression

  • models the expectation of a random variable as a linear function of another random variable
  • is perhaps the simplest (useful) model

For simple linear regression, we will cover its methodology, diagnostics and application.

Motivation: Example 1

  • How is sales (in thousands of units) for a particular product related to advertising budgets (in thousands of dollars) for TV, radio or newspaper?

  • How accurately can sales amount be predicted by advertising budgets for TV, radio or newspaper?

  • Is splitting advertising budgets for radio and TV better than allocating it only to TV (or radio) in terms of increasing sales amount?

Motivation: Example 1

Motivation: Example 2

  • How is Balance of credit card users related to Gender or marital status (Married or not)?

  • How is Balance of credit card users related to Age or Income?

  • Do users of different Gender in a given Income range have different Balance?

Motivation: Example 2

Motivation: Example 2

How is Balance of a credit card user related to Age or Income?

Intuitive Model

For the two motivating examples, simple linear models can be proposed as follows:

  • Sales \(\approx \beta_0\) + \(\beta_1 \times\) TV for some constants \(\beta_0\) and \(\beta_1\)
  • Balance \(\approx \beta_0\) + \(\beta_1 \times\) Income for some constants \(\beta_0\) and \(\beta_1\)
  • Balance \(\approx \beta_{0,0}\) + \(\beta_{1,0} \times\) Income for some constants \(\beta_{0,0}\) and \(\beta_{1,0}\) if Gender=Female
  • Balance \(\approx \beta_{0,1}\) + \(\beta_{1,1} \times\) Income for some constants \(\beta_{0,1}\) and \(\beta_{1,1}\) if Gender=Male

Model and interpretation

The model \[ \textsf{Sales} \approx \beta_0 + \beta_1 \times \textsf{TV}\] can be interpreated as \[E(Y) = \beta_0 + \beta_1 X\] with \(Y\)=Sales and \(X\)=TV.

Note: \(E(Y)\) is modelled as a linear function of \(X\).

Model and intepretation

The model \[\textsf{Balance} \approx \beta_{0,0} + \beta_{1,0} \times \textsf{Income} \quad \text{if} \quad \textsf{Gender=Female}\] can be interpreated as \[E(Y) = \beta_{0,0} + \beta_{1,0} X \quad \text{if} \quad X_1=0\] with \(Y\)=Balance and \(X\)=Income if \(X_1\)=Gender and \(X_1=0\), where Female is coded as \(0\).

Simple linear regression with a quantitative predictor

Summary

Simple linear regression with a quantitative predictor is used to model the mean of a quantitative random variable (called response variable) as a linear function of another quantitative random variable (called dependent variable).

True model

  • For two quantitative random varibles \(Y\) and \(X\), a simple linear model is \[E(Y) = \beta_0 + \beta_1 X,\] where \(\beta_0\) (intercept) and \(\beta_1\) (slope) are unknown, true model parameters (or coefficients), and \(\beta_1\) is called the regression coefficient.

  • The above model is equivalent to \[Y = \beta_0 + \beta_1 X + \varepsilon \quad \text{ with } \quad E(\varepsilon)=0,\] which is called the population regression line.

Discussion on model

  • Why a model \[Y = \beta_0^{\prime} + \beta_1^{\prime} X + \varepsilon^{\prime} \quad \text{ with } \quad E(\varepsilon^{\prime}) \ne 0\] can always be written into \[Y = \beta_0 + \beta_1 X + \varepsilon \quad \text{ with } \quad E(\varepsilon) = 0\] when \(\varepsilon\) is independent of \(X\)?

  • What does \(\varepsilon\) represent?

Intepretation

The model \[E(Y) = \beta_0 + \beta_1 X\] and its equivalent \[Y = \beta_0 + \beta_1 X + \varepsilon \quad \text{ with } \quad E(\varepsilon)=0\] postulate the following:

  • The values of \(Y\) oscillate around the line \[l(X)=\beta_0 + \beta_1 X\] with “cancelling magnitudes”.
  • For each unit change in \(X\), \(E(Y)\) changes \(\beta_1\) units, where the units for \(X\) and \(Y\) can be different.
  • When \(X=0\), \(E(Y)=\beta_0\).

Estimated model

With observations \((x_i,y_i),i=1,\ldots,n\) for \((X,Y)\),

  • the model postulates \[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,\] where \(\varepsilon_i\) is an unobservable realization of \(\varepsilon\);

  • an estimated model is \[\hat{y}=\hat{\beta}_0 + \hat{\beta_1} x,\] where \(\left(\hat{\beta}_0,\hat{\beta}_1\right)\) is an estimate of \(({\beta}_0,{\beta}_1)\), and \(\left(\hat{\beta}_0,\hat{\beta}_1\right)\) is a function of the data \((x_i,y_i)\) for \(i=1,\ldots,n\).

Residuals

Based on an estimated model \(\hat{y} = \hat{\beta}_0 + \hat{\beta_1} x\),

  • each \(y_i\) is estimated as \[\hat{y}_i = \hat{\beta}_0 + \hat{\beta_1} x_i\]
  • each \(y_i\) has residual \(e_i = y_i- \hat{y}_i\)
  • each \(e_i\) is an estimate of the unobservable \(\varepsilon_i\)

Note: Almost all information on \(\varepsilon\) is contained in the \(e_i\)’s.

Observations

  • Let \(X\) = TV and \(Y\)=sales
  • \(n=200\) observations \((x_i,y_i),i=1,\ldots,n\) for \((X,Y)\)

Check correlation:

> cor(adData$sales,adData$TV)
[1] 0.7822244

Target: Estimate the model \(E(Y) = \beta_0 + \beta_1 X\).

Scatterplot

Scatterplot of \(X\)=TV and \(Y\)=sales:

An estimated model

An estimated model \(\hat{y} = 5 + 0.05 x\):

Methods of estimation

  • Since an estimate \(\left(\hat{\beta}_0,\hat{\beta}_1\right)\) is a function of the observations \((x_i,y_i)\) for \(i=1,\ldots,n\), there are infinitely many lines \[\hat{l}(X)=\hat{\beta}_0 + \hat{\beta}_1X\] with \(\hat{\beta}_0,\hat{\beta}_1 \in \mathbb{R}\) that can be used as an estimate of \(E(Y) = \beta_0 + \beta_1 X\).

  • Which line to should we use?

An optimal estimate

Estimate based on least squares \(\hat{y} = 7.03+0.047x\):

Comparison

An estimate and an optimal estimate:

Least squares (LS) method

Each fitted (or estimated) model \(\hat{y}=\hat{\beta}_0 + \hat{\beta}_1 x\) produces

  • estimated \(y_i\) as \(\hat{y}_i=\hat{\beta}_0 + \hat{\beta}_1 x_i\)
  • residual \(e_i = y_i- \hat{y}_i\)

So, we seek for \(\left(\hat{\beta}_0,\hat{\beta}_1\right)\) for which the corresponding residual sum of squares (RSS) \[\textsf{RSS}=\sum_{i=1}^n e_i^2=\sum_{i=1}^n \left[y_i - \left(\hat{\beta}_0 + \hat{\beta}_1 x_i\right)\right]^2\] is minimized. This is called the least squares method, and \(\left(\hat{\beta}_0,\hat{\beta}_1\right)\) the least squares estimate of \(\left({\beta}_0,{\beta}_1\right)\).

The least squares estimate

The LS method gives the least squares estimate (LSE):

  • \(\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}\) with \(\bar{y}=n^{-1}\sum_{i=1}^n y_i\)

  • \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\) with \(\bar{x}=n^{-1}\sum_{i=1}^n x_i\)

Namely, the fitted model is \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x,\] also called the least squares line.

Note: “fitted model” is the same as “estimated model”.

RSS surface

License and session Information

License

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] knitr_1.21

loaded via a namespace (and not attached):
 [1] compiler_3.5.0  magrittr_1.5    tools_3.5.0    
 [4] htmltools_0.3.6 revealjs_0.9    yaml_2.2.0     
 [7] Rcpp_1.0.0      stringi_1.2.4   rmarkdown_1.11 
[10] stringr_1.3.1   xfun_0.4        digest_0.6.18  
[13] evaluate_0.12