7 Jan, 2016
Classical Inference Glossary
Bias: $E[\hat{\theta}] - \theta$
- Note that by construction, method of moment estimators are unbaised for the moments.
- Unbiasedness is not invariant to transformations.
- The variances can be used to compare two estimators with the same bias.
Consistency: An estimator $\hat\theta_n$ is consistent if $\hat\theta_n\overset{P}{\rightarrow}\theta$
- You can prove consistency typically in two ways:
- using the definition of consistency (checking convergence in probability)
- If $\lim E\brak{\hat\theta} = \theta$ and $\lim Var\brak{\hat\theta} = 0$, then $\hat\theta$ is consistent. That is, if the estimator is asymptotically unbiased and has zero variance, then it is consistent.
Relative Efficiency: $RE(\hat\theta_n,\tilde\theta_n) = \frac{V(\hat\theta_n)}{V(\tilde\theta_n)}$.
Asymptotic Relative Efficiency: $ARE(\hat\theta_n,\tilde\theta_n) = \lim \frac{V(\hat\theta_n)}{V(\tilde\theta_n)}$.
Efficiency:
MSE: MSE($\hat\theta$) = E$\brak{\paren{\hat\theta -\theta}^2}$ = $\text{Bias}(\hat\theta)^2 + \text{Var}(\hat\theta)$
Exponential Family: $p(x|\theta) = h(x)c(\theta)\exp\{\sum_{k=1}^K t_k(x)w_k(\theta)\}$, where $K$ is the number of parameters (i.e. the dimensions of $\theta$).
Natural Exponential Family:
$p(x|\eta) = h(x)\exp\{\sum_{k=1}^K \eta_k t_k(x) - \psi(\eta)\}$, with natural parameter $\eta$, and sufficient statistic $\sum_{k=1}^K t_k(x)$.
Convergence in Probability: $\forall \epsilon \gt 0, \lim_{n\rightarrow\infty} P(\abs{X_n-X}>\epsilon) = 0$. Usually, convergence in probability can be proved by Chebychev’s inequality. Convergenve in probability always implies convergence in distribution.
- Note: $X_n \overset{P}{\rightarrow} X \Rightarrow h(X_n) \overset{P}{\rightarrow} h(X)$, for $h(\cdot)$ a continuous function.
Chebychev’s Inequality: $\displaystyle P( \abs{g(x) -\mu} \ge r)~\le~\frac{E(g(x))}{r}$
Convergence in Distribution: $\lim_{n\rightarrow\infty} F_n(x) = F(x)$, $F$ is continuous. Convergence in distribution to a constant implies convergence in distribution to the same constant.
Weak Law of Large Numbers (WLLN): $\bar X_n \overset{P}{\rightarrow} \mu$
CLT: $\frac{\bar X_n - \mu}{\sigma/\sqrt n} \overset{D}{\rightarrow} N(0,1)$
Slutsky’s Theorem: If $X_n \overset{D}{\rightarrow} X$ and $ Y_n \overset{D}{\rightarrow} a$, then
- $Y_n X_n \overset{D}{\rightarrow} aX$
- $X_n + Y_n \overset{D}{\rightarrow} X + a$
Delta Method: If $\sqrt{n}[Y_n-\theta] \overset{D}{\rightarrow} N(0,\sigma^2)$ then
\(\sqrt{n}[g(Y_n)-g(\theta)] \overset{D}{\rightarrow} N(0,\sigma^2[g'(\theta)]^2)\)
Data Reduction:
Sufficient Statistic: $T(x)$ is sufficient for $\theta$ iff $\forall x$ in the sample space $p(x|T(x)) = \frac{p(x|\theta)}{q(t|\theta)}$ is independent of $\theta$. (Not unique)
Minimal Sufficient Statistic (mss): An mss $T(x)$ can be written as a function of any other sufficient statistic $T(x)$.
- Find mss using this important theorem: If $\frac{f(x|\theta)}{f(y|\theta)}$ is constant as a function of $\theta$ iff $T(x) = T(y)$, then $T(x)$ is mss.
- Use this to find mss for all distributions, as practice. (Remember to prove both directions)
Ancillary Statistics: A statistic $S(x)$ whose distribution does not depend on the parameter of interest $\theta$ is an ancillary statistic. (No shortcuts, must calculate the pdf and show that it is constant w.r.t. $\theta$.)
- The range is ancillary to the location parameter in location families
- The ratio of two random variables in a scale family is ancillary to the scale parameter
- Ancillary and sufficient statistics are not always independent
Complete Statistic: Let $f(t|\theta)$ be a family of pdfs or pmfs for a statistic $T(x)$ (i.e., $T(x)$ is a statistic / transformation of the data: mean, sd, etc.). If $E_{t|\theta}[g(T)] = 0 $ for all $\theta$ implies that $P(g(T)=0) = 1$ for all $\theta$, then $T(x)$ is a complete statistic.
- complete statistics are also minimal sufficient statistics (if they exist)
- In the exponential family, $T_k(X)$ is complete (as long as the parameter space $\Theta^d$ contains an open set in $\mathcal{R}^k$, i.e. $d \le k$)
- Note: $\frac{d}{d\theta} \int_0^\theta~g(t)~dt = G’(\theta) - G’(0) = g(\theta) - g(0)$, if $g$ is Riemman integrable.
Basu’s Theorem: Complete mss are independent of every ancillary statistic.
- Example 6.2.26 is an interesting example
Factorization Theorem: $f(x|\theta) = g(t|\theta) h(x)$ iff $T(x)$ is sufficient for $\theta$.
- Exercise: pick a distribution and find the sufficient statistic by identifying $g(.),h(.)$
Hessian: In computing the MLE, the second derivative equivalent in bivariate case is the Hessian matrix H: Matrix of second derivatives. Evaluate $H=H(\hat\mu,\hat\phi)$ and check that it has all negative eigenvalues. True for higher dimensions.
\(H =
\left(
\begin{matrix}
l_{\mu\mu} & l_{\mu\phi} \\\\
l_{\phi\mu} & l_{\phi\phi} \\\\
\end{matrix}
\right)\)
Invariance of MLE: Let mle of a function of a parameter is the function of the mle of the parameter.
Satterwhite Approximation: Approximating $Z = \sum a_i Y_i$, $Y_i \sim \chi^2_{r_i}$.
- Approx with $Z \approx \chi^2_{\nu}/\nu$
- $E(Z) = \sum a_i r_i \approx E(\chi^2_\nu/\nu) = 1$
- find $E(Z^2)$ then solve for $\nu$.
- Finally, $\tilde\nu = \frac{2}{(\sum a_i Y_i)^2-1}$
Statistics Joke:
Three statisticians go hunting in a forest. The first one shoots one meter to the left; the other shoots one meter to the right. The third statistician says, “we got it!”
- UMVUE
- Uniformly Minimum Variance Unibaised Estimator
- Cramer-Rao Lower Bound
- Let \(x_1,...,x_n\) be a sample with pdf \(f(\pmb x|\theta)\) and let
$W(\pmb x)$ be any function satisfying
-
\[\frac{d}{d\theta} \int_\mathcal{X}~W(x)f(x|\theta) ~dx = \int_\mathcal{X}~\frac{d}{d\theta}W(x)f(x|\theta) ~dx\]
-
\[Var(W(x)) \lt \infty\]
then,
\[Var(W(x)) \ge \frac{ \brak{\frac{d}{d\theta}E_{x|\theta}(W(x))}^2 }{E_{x|\theta}\paren{\paren{\frac{d}{d\theta}\log f(x|\theta)}^2}}\]
- Rao-Blackwell
- Let $W = W(x)$ be an unbiased estimator of $\tau(\theta)$, and let $T$ be a sufficient statistic for $\theta$. Define $\phi(T) = E(W|T)$. Then $E(\phi(T)) = T(\theta)$ and $Var(\phi(T)) \le Var(W)$ for all $\theta$.
- Theorem
- UMVUE’s are unique
- Theorem
- Let $T$ be a complete sufficient statistic for $\theta$, and let $\phi(T)$ be any estimator of $\theta$ based on $T$. Then $\phi(T)$ is the unique best unbiased estimator of its expected value.
- Theorem (for checking an estimator is not UMVUE)
- If $E_{x|\theta}(W)=\tau(\theta)$ $W$ is the best unbiased estimator for $\tau(\theta)$ iff $W$ is uncorrelated with all unbaised estimators of 0. (i.e. $Corr(W,W^*)=0$, where $E[W^*]=0$.)
EM (Expectation-Maximization) Algorithm
- Let $Y$ = “incomplete” data (observed)
- Let $X$ = “augmented” data (unobserved)
- Let $(X,Y)$ = “complete” data
- Then $p(x,y | \lambda)$ is supposedly the likelihood
- And $p(x | y,\lambda) = \frac{p(x,y | \lambda)}{p(y | \lambda)}$
- Startwith some $\lambda^0$, and sequentially compute: $\lambda^{(i)} = \underset{\lambda}{argmax} E[\log P(x,y | \lambda) | \lambda^{(i-1)}, y]$
Hypothesis testing
- Likelihood Ratio statistic
- \(\lambda(x) = \frac{\underset{\theta\in H_0}{sup} L(\theta;x,y)}{\underset{\theta\in H_0 \cup H_a}{sup} L(\theta;x,y)}\)
reject when $\lambda \lt c$ where $c$ is small
- Power Function
- The power function of a hypothesis test with rejection region $R$ is the function of the parameter $\theta$ defined by \(\beta(\theta) = P_\theta(X\in R)\). When \(\theta = \theta_0\), \(\beta(\theta) = \alpha\). Yes, confusing nomenclature.
|
Reject Null |
Accept Null |
Null is True |
Type I error |
ok |
Null is False |
ok |
Type II error |
Type I error \(=\beta(\theta_0) = \alpha\)
Type II error, could be infemum of power, or \(1-\beta(\theta_1) = 1-\gamma\)
- Most Powerful Test
- Let $G$ be a class of tests for testing \(H_0:\theta\in\Theta_0\) vs \(H_a:\theta\in\Theta_0^C\). A test in class $G$ with power function $\beta(\theta)$ is a uniformly most powerful (UMP) class $G$ test if $\beta \ge \beta’(\theta)$ for every $\theta\in\theta^C$ and every $\beta’(\theta)$ that is a power function of a test in class $G$.
- Neyman-Pearson lemma (simple-simple)
- Consider testing \(H_0: \theta=\theta_0\) vs \(H_1: \theta=\theta_1\) where the pdf or pmf is $f(x | \theta_i)$, $i=0,1$, using a test with rejection region
\[\begin{array}{lcl}
X \in R &\text{if}& f(x|\theta_1) \gt kf(x|\theta_0) \\
X \in R^C &\text{if}& f(x|\theta_1) \le kf(x|\theta_0) \\
\end{array}\]
for some ($k\ge0$) such that $\alpha = \beta(\theta_0)$. Then
- Sufficiency: Any test that satisfies this is a UMP level $\alpha$ test.
- Necessity: If there is a test satisfying these conditions with $k\gt 0$, then every UMP level $\alpha$ test is a size $\alpha$ and every UMP level $\alpha$ test satisfies the structure of the rejection region except perhaps on a set $A$ satisfying \(P_{\theta_0}(X\in A) = P_{\theta_1}(X\in A) = 0\).
- Monotone Likelihood Ratio tests
- A family of pdfs or pmfs $\brac{g(t|\theta): \theta\in\Theta}$ for a univariate random variable $T$ with real valued parameter $\theta$ has a monotone likelihood ratio (MLR) if $\forall(\theta_2 \gt \theta_1)$, $\frac{g(t|\theta_2)}{g(t|\theta_1)}$ is a monotone function of $t$.
- Karlin-Rubin Theorem
- Consider testin \(H_0: \theta \lt \theta_0\) vs \(H_1: \theta \gt \theta_0\). Suppose that $T$ is a sufficient statistic for $\theta$ and the family of pdfs or pmfs $\brac{g(t|\theta): \theta\in\Theta}$ of $T$ has a MLR. Then for any $t_0$, the test that rejects $H_0$ iff $T\gt t_0$ is a UMP level $\alpha$ test where $\alpha = P_{\theta_0}(T\gt t_0)$
- p-value
- A p-value $p(x)$ is a test statistic satisfying $0\le p(x) \le 1$ for every sample $x$. Small values of $p(x)$ give evidence that $H_1$ is true. A p-value is called valud if for every $\theta \in \Theta_0$ and every $0 \lt \alpha \lt 1$, $P_{\theta}(p(x) \lt \alpha) \le \alpha$
When the distribution of the test statistic is not available in closed form, we can
- use asymptotics
- permutation test
- Wilk’s Theorem
- For testing $H_0: \theta = \theta_0$ vs $H_1: \theta \ne \theta_0$ suppose $x_1,…,x_n$ are iid from $f(x|\theta)$, $\hat\theta$ is the MLE and $f$ satisfies regularity conditions. Under the null hypothesis, as $n \rightarrow \infty$,
\[-2 log(\lambda) \overset{D}{\rightarrow} \chi_1^2\]
More generally, if $H_0 \in \Theta_0$ vs $H_1 \in \Theta_0^C$ then
\[-2 log(\lambda) \overset{D}{\rightarrow} \chi_d^2\]
where $d=dim(\Theta) - dim(\Theta_0)$
- Permutation test
- if under the null the distribution of the data is invariant under the action of permutations, the distribution of statistics can be easily computed.
Example
$ X_1,…,X_n \sim F_1$ and $ Y_1,…,Y_n \sim F_2$. $H_0: F_1 = F_2$ vs $H_1: F_1 \ne F_2$.
We can test this by randomly relabelling the samples and testing to see if the distributions are the same by seeing if some statistic of the difference of the distributions $Z = g(X) - g(Y)$ is different from 0.
- Wald Likelihood Test
- Under large samples, we can test the two-sided hypothesis $H_0: \theta=\theta_0$ vs. $H_1: \theta\ne\theta_0$ (not necessarily bidirectional, can be simply greater than or less than) with the statistic
\[Z_n = \frac{\hat\theta-E[\hat\theta]}{\sqrt{Var\p{\hat\theta}}} = \frac{W_n - \theta_0}{S_n}\]
Note that when $W_n$ is the mle and $S_n$ is the $\frac{1}{\sqrt{I(W_n)}}$, $Z_n \overset{D}{\rightarrow} N(0,1)$. We can also substitute $\sqrt{Var(\hat\theta)}$ with $se(\hat\theta)$. Using $se(\hat\theta)$ is often easier when inverting a test, and that is used most often in intro stats courses. The wald test is an approximate test. Using a wald test and inverting it to get a confidence interval may show up in the first year exam. (Try doing this for all exponential family distribution models.)
- Interval Estimation
- An interval estimate of a real-valued parameter $\theta$ is any pair of functions $L(X)$ and $U(X)$ of a sample that satisfies $L(X) \lt U(X)$ for all $X$ in the sample space. If $X$ is observed the implied inference is that $L(X) \lt \theta \lt U(X)$. Note that here, $X$ is a random variable.
- Confidence Coefficient
- $\underset{\theta}{\text{inf}}~P_\theta(\theta \in (L(X), U(X)) )$
- Pivot
- Intervals that are based on point estimators whose distribution does not depend on the parameters of the model.
- Test Inversion
- Like probability that confidence interval contains the hypothesized value. The test is constructed such that it will have the desired coverage.
Under regularity conditions (see below), if $x_i$ are iid $f(x_i\v\theta)$ and $\hat{\theta}_n$ are the MLE’s for $\theta$ then
\[\frac{\hat\theta - \theta}{\sqrt{Var(\hat\theta)}} \overset{D}{\rightarrow} N(0,1) ~~ \text{assymptotically Normal}\]
and
\[\hat\theta \overset{P}{\rightarrow} \theta ~~ \text{consistent}\]
Regularity Conditions
- iid observations
- Identifiability of parameters (i.e. $\theta \ne \theta’ \Rightarrow f(x\v\theta) \ne f(x\v\theta’), \forall x$)
- $f(x\v\theta)$ to have the same support for every $\theta$ (support does not depend of parameter) and be differentiable in $\theta$.
- The parameter space contains an open set $\omega$ of which the true value of the parameter $\theta_0$ is an interior point. (true parameter is not on the boundaries of the parameter space.)
- For every $x$ the density $f(x\v\theta)$ is three times differentiable wrt $\theta$, the thrid derivatie is continuous in $\theta$ andd $\int f(x\v\theta) dx$ can be differentiated 3 times under the integral sign.
- For any $\theta_0$ there exists a positive number $c$R and a function $M(x)$ (both of which can depend on $\theta_0$) such that the $\abs{\frac{\partial^3}{\partial\theta^3}\log f(x\v\theta)} \le M(x)$ for $\theta_0-c \lt \theta \lt \theta_0+c$ with $E_{x\v\theta}(M(x)) \lt \infty$
Consistency requies 1-4. Normality requires 1-6.
- Likelihood Profile
- A likelihood of the form $L(\theta_1,\hat\theta_0(\theta_1))$. Where $\hat\theta_0(\theta_1)$ are the MLE’s for a set of parameters $\theta_0$ evaluated (in closed form) based on some parameters $\theta_1$(for which the MLE’s cannont be found in closed form).
Parametric Bootstrap
For a likelihood $f$,
- Draw $y\sim f(y\v\hat\theta_{MLE})$
- For $b = 1:B$
- $y^{(b)} \sim f(y\v\hat\theta_{MLE})$
- Compute $\hat\theta_{MLE}(y^{(b)})$
Nonparametric Bootstrap
Sample data of size $n$ with replacement, fit model, get estimates.
Possible Projects
- Comparing Newton Raphson, SGD, and GD for finding MLE of a (generalized) linear model
- speed and difficulty
- review EM