
In practice, we do not have full information on a data-generating process. Hence, we do fully know the random variable that is used to model the process. Instead, we only have observations from the process.
To statistically learn the process, various sample statistics are obtained from these observations (in contrast to population statistics). However, these observations inherit uncertainty from the process and are random variables too. So, a statistic has its own probabilistic behavior.
Let \(X\) be a random variable with expectation \(\mu=E(X)\) and variance \(\sigma^2=Var(X)\), and suppose \(x_1,x_2,\ldots,x_n\) are \(n\) observations from \(X\).
The sample mean of \(X\) is defined as \(\hat{\mu}_n = n^{-1}\sum_{i=1}^n x_i\)
The sample variance of \(X\) is defined as \[\hat{\sigma}_n^2 = (n-1)^{-1}\sum_{i=1}^n (x_i - \hat{\mu}_n)^2\]
Note: sample mean and sample variance estimate population mean and population variance, respectively.
Suppose the \(n\) \(x_i\)’s are mutually independent,
What is the expectation of the sample mean \(\hat{\mu}_n\)?
What is the variance of \(\hat{\mu}_n\)? Is \(\hat{\mu}_n\) less variable than \(X\)?
What is the expectation of the sample variance \(\hat{\sigma}_n^2\)? Is \(\hat{\sigma}_n^2\) less variable than \(X\)?
Note: If \(x_i,i=1,\ldots,n\) are independent and follow the same distribution, they are called “i.i.d.” and also called a random sample (of size \(n\)) from \(X\).
Variance of sample mean \(\hat{\mu}_n\) under i.i.d. assumption:
Let \(x_i,i=1,\ldots,n\) be \(n\) observations from \(X\), and \(y_i,i=1,\ldots,n\) those from \(Y\).
The sampe covariance between \(X\) and \(Y\) is defined as \[\widehat{Cov}(X,Y)= (n-1)^{-1} \sum_{i=1}^n (x_i-\hat{\mu}_{n,X})(y_i-\hat{\mu}_{n,Y}),\] where \(\hat{\mu}_{n,X}\) is the sample mean for \(X\), and \(\hat{\mu}_{n,Y}\) the sample mean for \(Y\)
Note: when \(X=Y\), sample covariance becomes sample variance.
The sample corretion between \(X\) and \(Y\) is defined as \[\widehat{Cor}(X,Y)= \frac{\widehat{Cov}(X,Y)}{\hat{\sigma}_{n,X}\hat{\sigma}_{n,Y}},\] where \(\hat{\sigma}_{n,X}\) is the sample standard deviation of \(X\), and \(\hat{\sigma}_{n,Y}\) that for \(Y\).
Note: sample covariance and sample correlation estimate population covariance and population correlation, respectively.

Four sets of data with the same correlation of 0.816. (Image credit: wikipedia)

Samples with a given population correlation. (Image credit: wikipedia)
Suppose \(X\) and \(Y\) are independent. Given a random sample of size \(n\) from \(X\) and \(Y\), respectively, let \(r_n\) be the sample correlation computed from the two random samples.
Should \(r_n\) be \(0\)? Why or why not?
Is it more likely for \(r_n\) to be \(0\) or not \(0\)? Why or why not?
Let \(X\) be a random variable. Suppose we are interested in a statistic \(\theta\) of \(X\), such as its mean, median, or variance. Then
Often a test statistic \(T\) is constructed, so that its distribution is known if the null hypothesis is true, and then a decision rule is obtained to either reject or retain \(H_0\) under some error criterion.
For example, if we want to assess if a pesticide is able to kill grasshoppers, we can let \(\mu_1\) be the mean number and \(\mu_2\) the mean number of grasshoppers respectively prior and after pesticide usage, then
A test statistic for this can be set as \(d\), the difference between the sample mean numbers of grasshoppers prior and after pesticide usage. Once we know the distribution of \(d\) under \(H_0\), a decision rule can be obtained to reject or retain \(H_0\).
Often a test is conducted under a contraint on its Type I error and attempts to be the most powerful (among all tests with certain properties).
Let \(X\) be a Gaussian random variable with mean \(\mu\) and standard deviation \(\sigma>0\). Suppose we want to test \(H_0: \mu=0\) versus \(H_a: \mu \ne 0\), given a random sample \(x_1,x_2,\ldots,x_n\) of size \(n\).
Then we can compute the sample mean \(\hat{\mu}_n = n^{-1}\sum_{i=1}^n x_i\) and sample variance \[\hat{\sigma}_n^2 = (n-1)^{-1}\sum_{i=1}^n (x_i - \hat{\mu}_n)^2\]
The test statistic \(T = \frac{\hat{\mu}_n}{\hat{\sigma}_n/\sqrt{n}}\) follows a Student t distribution, denoted by \(F_{0,n-1}\), with centrality parameter \(0\) and degree of freedom \(n-1\).
At Type I error level \(\alpha\), we reject \(H_0: \mu=0\) if \(\vert T \vert > t_{0,n-1}(1-\alpha/2)\), where \(t_{0,n-1}(1-\alpha/2)\) is the upper \((1-\alpha/2) \times 100\)-th percentile of the distribution \(F_{0,n-1}\).
The two-sided p-value for the test is \(2 \times F_{0,n-1}(-\vert T \vert)\).
Note: this example is a two-sided test.
For \(s \in [0,1]\), a \((1-s) \times 100\%\) confidence interval (CI) is an interval that contains an unknown population parameter with probability \(1-s\).
Often, the distribution of a test statistic under the null hypothesis gives sufficient information on constructing a CI.
The strong law of large numbers (SLLN) describes the behavior of a sequence of random variables that are indexed by sample size, such that as the sample size tends to infinity, it is almost certain that the limiting random variable assumes a single value, i.e., its expectation.
The SLLN is the strongest large-sample characterization of the probabilistic behavior of a sequence of random variables. However, it cannot be used for hypothesis testing on an estimate since we only have finitely many observations.
Let \(X\) be a random variable with expectation \(\mu=E(X)\) and variance \(\sigma^2=Var(X)\), and suppose \(x_1,x_2,\ldots,x_n\) a random sample of sample size \(n\) for \(X\).
The sequence of sample variances \[\hat{\sigma}_n^2 = (n-1)^{-1}\sum_{i=1}^n (x_i - \hat{\mu}_n)^2,\] indexed by \(n\), always has expectation \(\sigma^2\), and satisfied the SLLN, i.e., \[\Pr\left(\lim_{n \to \infty} \hat{\sigma}_n^2 = \sigma^2\right)=1.\]
The central limit theorem (CLT) describes the limiting behavior of a sequence of standardized estimates that are obtained from a set of weakly interacting random variables. The CLT and the Glivenko-Cantelli theorem form the corner stone of statistical learning.
In particular, the CLT enables us to learn a data-generating process whenever we have a random sample of relatively large size. Unlike the SLLN, the CLT can be used to conduct hypothesis testing and construct confidence intervals.
Let \(X\) be a random variable with expectation \(\mu=E(X)\) and variance \(\sigma^2=Var(X)\), and suppose \(x_1,x_2,\ldots,x_n\) a random sample of sample size \(n\) for \(X\).
The standardized sequence of sample means \[\tilde{\mu}_n = \frac{\hat{\mu}_n - \mu}{\sigma/\sqrt{n}}\] satisfies the CLT.
Namely, as \(n\) becomes larger, the distribution of \(\tilde{\mu}_n\) becomes closer to the distribution of \(Z\), the standard Gaussian random variable.
In mathematical notations, the above is \[\Pr(\lim_{n \to \infty} \tilde{\mu}_n \le x) = \Pr(Z \le x) \text{ for any } x.\]
An animation for CLT: https://yihui.org/animation/example/clt-ani/
The SLLN and CLT are frequently used (sometimes implicitly) in practices of statistical learning and statistical thinking
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.21
loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5 tools_3.5.0
[4] htmltools_0.3.6 revealjs_0.9 yaml_2.2.0
[7] Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
[10] stringr_1.3.1 xfun_0.4 digest_0.6.18
[13] evaluate_0.12