Statistical Inference I

Random Sampling

The collection of random variables X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n is said to be a random sample of size nn if they are independent and identically distributed (i.i.d.), i.e.,

  1. X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n are independent random variables, and
  2. they have the same distribution, i.e.,
FX1(x)=FX2(x)==FXn(x),for all xR.F_{X_1}(x) = F_{X_2}(x) = \cdots = F_{X_n}(x), \quad \text{for all } x \in \mathbb{R}.

Evaluating Estimators

Let Θ^=h(X1,X2,,Xn)\hat{\Theta} = h(X_1, X_2, \cdots, X_n) be a point estimator for θ\theta. The bias of point estimator Θ^\hat{\Theta} is defined by

B(Θ^)=E[Θ^]θ.B(\hat{\Theta}) = E[\hat{\Theta}] - \theta.

Let Θ^=h(X1,X2,,Xn)\hat{\Theta} = h(X_1, X_2, \cdots, X_n) be a point estimator for a parameter θ\theta. We say that Θ^\hat{\Theta} is an unbiased of θ\theta if

B(Θ^)=0,for all possible values of θ.B(\hat{\Theta}) = 0, \quad \text{for all possible values of } \theta.

Point Estimators for Mean and Variance

Let X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n be a random sample with mean E[Xi]=μ<E[X_i] = \mu < \infty, and variance 0<Var(Xi)=σ2<0 < \text{Var}(X_i) = \sigma^2 < \infty. The sample variance of this random sample is defined as

S2=1n1k=1n(XkX)2=1n1(k=1nXk2nX2),S^2 = \frac{1}{n-1}\sum_{k=1}^n (X_k - \overline{X})^2 = \frac{1}{n-1}\left(\sum_{k=1}^n X_k^2 - n\overline{X}^2\right),

The sample variance is an unbiased estimator of σ2\sigma^2.

The sample standard deviation is defined as

S=S2,S = \sqrt{S^2},

and is commonly used as an estimator for σ\sigma. Nevertheless, SS is a biased estimator of σ\sigma.

Maximum Likelihood Estimation

Let X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n be a random sample from a distribution with a parameter θ\theta. Suppose that we have observed X1=x1,X2=x2,,Xn=xnX_1=x_1, X_2=x_2, \dots, X_n=x_n.

  1. If XiX_i's are discrete, then the likelihood function is defined as
L(x1,x2,,xn;θ)=PX1X2Xn(x1,x2,,xn;θ).L(x_1,x_2,\cdots,x_n;\theta) = P_{X_1X_2\cdots X_n}(x_1,x_2,\cdots,x_n;\theta).
  1. If XiX_i's are jointly continuous, then the likelihood function is defined as
L(x1,x2,,xn;θ)=fX1X2Xn(x1,x2,,xn;θ).L(x_1,x_2,\cdots,x_n;\theta) = f_{X_1X_2\cdots X_n}(x_1,x_2,\cdots,x_n;\theta).

In some problems, it is easier to work with the log likelihood function given by

lnL(x1,x2,,xn;θ).\ln L(x_1,x_2,\cdots,x_n;\theta).

Asymptotic Properties of MLEs

Let X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n be a random sample from a distribution with a parameter θ\theta. Let Θ^ML\hat{\Theta}_{\text{ML}} denote the maximum likelihood estimator (MLE) of θ\theta. Then, under some mild regularity conditions,

  1. Θ^ML\hat{\Theta}_{\text{ML}} is asymptotically consistent, i.e.,
limnP(Θ^MLθ>ϵ)=0.\lim_{n\to\infty} P\left(\left|\hat{\Theta}_{\text{ML}} - \theta\right| > \epsilon\right) = 0.
  1. Θ^ML\hat{\Theta}_{\text{ML}} is asymptotically unbiased, i.e.,
limnE[Θ^ML]=θ.\lim_{n\to\infty} E\left[\hat{\Theta}_{\text{ML}}\right] = \theta.
  1. As nn becomes large, Θ^ML\hat{\Theta}_{\text{ML}} is approximately a normal random variable. More precisely, the random variable
Θ^MLθVar(Θ^ML)\frac{\hat{\Theta}_{\text{ML}} - \theta}{\sqrt{\text{Var}\left(\hat{\Theta}_{\text{ML}}\right)}}

converges in distribution to N(0,1)N(0,1).

Interval Estimation

Let X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n be a random sample from a distribution with a parameter θ\theta that is to be estimated. An interval estimator with confidence level 1α1-\alpha consists of two estimators Θ^l(X1,X2,,Xn)\hat{\Theta}_l(X_1,X_2,\cdots,X_n) and Θ^h(X1,X2,,Xn)\hat{\Theta}_h(X_1,X_2,\cdots,X_n) such that

P(Θ^lθ and Θ^hθ)1α,P\left(\hat{\Theta}_l \leq \theta \text{ and } \hat{\Theta}_h \geq \theta\right) \geq 1-\alpha,

for every possible value of θ\theta. Equivalently, we say that [Θ^l,Θ^h]\left[\hat{\Theta}_l, \hat{\Theta}_h\right] is a (1α)100(1-\alpha)100% confidence interval for θ\theta.

Pivotal Quantity

Let X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n be a random sample from a distribution with a parameter θ\theta that is to be estimated. The random variable QQ is said to be a pivot or a pivotal quantity, if it has the following properties:

  1. It is a function of the observed data X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n and the unknown parameter θ\theta, but it does not depend on any other unknown parameters:
Q=Q(X1,X2,,Xn,θ).Q = Q(X_1, X_2, \cdots, X_n, \theta).
  1. The probability distribution of QQ does not depend on θ\theta or any other unknown parameters.

The Chi-Squared Distribution

Definition If Z1,Z2,,ZnZ_1, Z_2, \cdots, Z_n are independent standard normal random variables, the random variable YY defined as

Y=Z12+Z22++Zn2Y = Z_1^2 + Z_2^2 + \cdots + Z_n^2

is said to have a chi-squared distribution with nn degrees of freedom shown by

Yχ2(n).Y \sim \: \vcenter{\chi}^{2} (n).

Properties:

  1. The chi-squared distribution is a special case of the gamma distribution. More specifically,
YGamma(n2,12).Y \sim \text{Gamma}\left(\frac{n}{2}, \frac{1}{2}\right).

Thus,

fY(y)=12n2Γ(n2)yn21ey2,for y>0.f_Y(y) = \frac{1}{
  \Large{2^{\frac{n}{2}}} \:
  \Gamma\raisebox{0.05em}{(} \vcenter{\normalsize\frac{n}{2}} \raisebox{0.05em}{)}
}
y^{\frac{n}{2}-1} e^{-\frac{y}{2}}, \quad \text{for } y > 0.
  1. E[Y]=nE[Y] = n and Var(Y)=2n\text{Var}(Y) = 2n.

  2. For any p[0,1]p \in [0,1] and nNn \in \mathbb{N}, we define χp,n2\chi^2_{p,n} as the real value for which

p(Y>χp,n2)=pp\left(Y > \chi^2_{p,n}\right) = p

where Yχ2(n)Y \sim \chi^2(n).

Theorem Let X1,X2,,XnX_1, X_2, \cdots, X_n be i.i.d. N(μ,σ2)N(\mu, \sigma^2) random variables. Also, let S2S^2 be the sample variance for this random sample. Then, the random variable YY defined as

Y=(n1)S2σ2=1σ2i=1n(XiX)2Y = \frac{(n-1)S^2}{\sigma^2} = \frac{1}{\sigma^2} \sum_{i=1}^n (X_i - \overline{X})^2  

has a chi-squared distribution with n1n-1 degrees of freedom, i.e., Yχ2(n1)Y \sim \vcenter{\chi}^2(n-1). Moreover, X\overline{X} and S2S^2 are independent random variables.

The t-Distribution

Let ZN(0,1)Z \sim N(0,1), and Yχ2(n)Y \sim \chi^2(n), where nNn \in \mathbb{N}. Also assume that ZZ and YY are independent. The random variable TT defined as

T=ZY/nT = \frac{Z}{\sqrt{Y/n}}

is said to have a tt-distribution with nn degrees of freedom shown by

TT(n).T \sim T(n).

Properties:

  1. The tt-distribution has a bell-shaped PDF centered at 00, but its PDF is more spread out than the normal PDF.

  2. E[T]=0E[T] = 0, for n>0n > 0. But E[T]E[T] is undefined for n=1n = 1.

  3. Var(T)=nn2\text{Var}(T) = \frac{n}{n-2}, for n>2n > 2. But, Var(T)\text{Var}(T) is undefined for n=1,2n = 1,2.

  4. As nn becomes large, the tt density approaches the standard normal PDF. More formally, we can write

T(n)dN(0,1).T(n) \xrightarrow{d} N(0,1).
  1. For any p[0,1]p \in [0,1] and nNn \in \mathbb{N}, we define tp,nt_{p,n} as the real value for which
P(T>tp,n)=p.P(T > t_{p,n}) = p.

Since the tt-distribution has a symmetric PDF, we have

t1p,n=tp,n.t_{1-p,n} = -t_{p,n}.

Theorem Let X1,X2,,XnX_1, X_2, \cdots, X_n be i.i.d. N(μ,σ2)N(\mu, \sigma^2) random variables. Also, let S2S^2 be the sample variance for this random sample. Then, the random variable TT defined as

T=XμS/nT = \frac{\overline{X} - \mu}{S/\sqrt{n}}

has a tt-distribution with n1n-1 degrees of freedom, i.e., TT(n1)T \sim T(n-1).

Confidence Intervals for the Mean of Normal Random Variables

(1α)100(1-\alpha)100% confidence interval

Assumptions: A random sample X1,X2,X3,,XnX_1, X_2, X_3, \ldots, X_n is given from a N(μ,σ2)N(\mu,\sigma^2) distribution, where Var(Xi)=σ2\text{Var}(X_i) = \sigma^2 is known.

Parameter to be Estimated: μ=E[Xi]\mu = E[X_i].

Confidence Interval: [Xzα/2σn, X+zα/2σn]\left[\overline{X} - z_{\alpha/2}\frac{\sigma}{\sqrt{n}},\ \overline{X} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right] is a (1α)100(1-\alpha)100% confidence interval for μ\mu.

(1α)(1-\alpha) confidence interval

Assumptions: A random sample X1,X2,,XnX_1, X_2, \ldots, X_n is given from a N(μ,σ2)N(\mu,\sigma^2) distribution, where μ=E[Xi]\mu = E[X_i] and Var(Xi)=σ2\text{Var}(X_i) = \sigma^2 are unknown.

Parameter to be Estimated: μ=E[Xi]\mu = E[X_i].

Confidence Interval: [Xtα/2,n-1Sn, X+tα/2,n-1Sn]\left[\overline{X} - t_{\alpha/2,n\text{-}1}\frac{S}{\sqrt{n}},\ \overline{X} + t_{\alpha/2,n\text{-}1}\frac{S}{\sqrt{n}}\right] is a (1α)(1-\alpha) confidence interval for μ\mu.

Confidence Intervals for the Variance of Normal Random Variables

Assumptions: A random sample X1,X2,,XnX_1, X_2, \ldots, X_n is given from a N(μ,σ2)N(\mu,\sigma^2) distribution, where μ=E[Xi]\mu = E[X_i] and Var(Xi)=σ2\text{Var}(X_i) = \sigma^2 are unknown.

Parameter to be Estimated: Var(Xi)=σ2\text{Var}(X_i) = \sigma^2.

Confidence Interval: [(n1)S2χα/2,n-12, (n1)S2χ1α/2,n-12]\left[\frac{(n-1)S^2}{\chi^2_{\alpha/2,n\text{-}1}},\ \frac{(n-1)S^2}{\chi^2_{1-\alpha/2,n\text{-}1}}\right] is a (1α)100(1-\alpha)100% confidence interval for σ2\sigma^2.

Statistic

Let X1,X2,,XnX_1, X_2, \cdots, X_n be a random sample of interest. A statistic is a real-valued function of the data. For example, the sample mean, defined as

W(X1,X2,,Xn)=X1+X2++Xnn,W(X_1,X_2,\cdots,X_n) = \frac{X_1 + X_2 + \cdots + X_n}{n},

is a statistic.

A test statistic is a statistic based on which we build our statistical test.

P-values

P-value is the lowest significance level α\alpha that results in rejecting the null hypothesis.

Likelihood Ratio Test

Let X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n be a random sample from a distribution with a parameter θ\theta. Suppose that we have observed X1=x1,X2=x2,,Xn=xnX_1=x_1, X_2=x_2, \dots, X_n=x_n. To decide between two simple hypotheses

H0:θ=θ0,H1:θ=θ1,\begin{align*}
H_0: \theta &= \theta_0, \\
H_1: \theta &= \theta_1,
\end{align*}

we define

λ(x1,x2,,xn)=L(x1,x2,,xn;θ0)L(x1,x2,,xn;θ1).\lambda(x_1,x_2,\cdots,x_n) = \frac{L(x_1,x_2,\cdots,x_n;\theta_0)}{L(x_1,x_2,\cdots,x_n;\theta_1)}.

To perform a likelihood ratio test (LRT), we choose a constant cc. We reject H0H_0 if λ<c\lambda < c and accept it if λc\lambda \geq c. The value of cc can be chosen based on the desired α\alpha

Simple Linear Regression

Given the observations (x1,y1),(x2,y2),,(xn,yn)(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n), we can write the regression line as

y^=β0+β1x.\hat{y} = \beta_0 + \beta_1 x.

We can estimate β0\beta_0 and β1\beta_1 as

β^1=sxysxx,β^0=yˉβ^1xˉ,\begin{align*}
\hat{\beta}_1 &= \frac{s_{xy}}{s_{xx}}, \\
\hat{\beta}_0 &= \bar{y} - \hat{\beta}_1\bar{x},
\end{align*}

where

sxx=i=1n(xixˉ)2,sxy=i=1n(xixˉ)(yiyˉ).\begin{align*}
s_{xx} &= \sum_{i=1}^n (x_i - \bar{x})^2, \\
s_{xy} &= \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).
\end{align*}

For each xix_i, the fitted value y^i\hat{y}_i is obtained by

y^i=β^0+β^1xi.\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i.

The quantities

ei=yiy^ie_i = y_i - \hat{y}_i

are called the residuals.

Coefficient of Determination

For the observed data pairs (x1,y1),(x2,y2),,(xn,yn)(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n), we define the coefficient of determination, r2r^2, as

r2=sxy2sxxsyy,r^2 = \frac{s_{xy}^2}{s_{xx}s_{yy}},

where

sxx=i=1n(xixˉ)2,syy=i=1n(yiyˉ)2,sxy=i=1n(xixˉ)(yiyˉ).s_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2, \quad
s_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2, \quad
s_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).

We have 0r210 \leq r^2 \leq 1. Larger values of r2r^2 generally suggest that our linear model

y^i=β^0+β^1xi\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i

is a good fit for the data.

Reference(s)

  • H. Pishro-Nik, Introduction to probability, statistics, and random processes. Kappa Research LLC, 2014.