8 Jan, 2016

Shrinkage & Regularization


OLS: Ordinary Least Square. $\hat{\beta} = \text{argmin}_\beta \norm{y-X\beta}^2 = (X’X)^{-1}X’y$

Ridge Regression:

  • minimize $\norm{y-X\beta}^2$ s.t. $\norm{\beta}^2 \lt s$
  • minimize $(y-X\beta)^2 + \lambda\sum \beta_j^2$
  • $\tilde{\beta} = (X’X+\lambda I)^{-1}X’y$
  • library(MASS)
  • lm.ridge(y~X, lambda=lambda_vector)
  • Choose $\lambda$ using cross validation (based on $\text{MSE}_k = \sum (y_i-\hat{y}_i)^2$)
    • MSE = $\sum \text{MSE_k} / K$
    • k-fold if $n$ is large
    • leave one out if $n$ is small
    • choose $\lambda$ that minimizes MSE
  • Note that we don’t penalize the intercept, because it’s the coefficient for predictors.
  • Typical to standardize to make scale invariant
  • Disadvantage: coefficients are shrunk to zero, but not exactly zero. So, coefficients tend to be smaller than the truth, but the model doesn’t select variables.

Lasso:

  • uses $l_1$ penalty instead of $l_2$
  • $\newcommand{\norm}[1]{\left\lVert#1\right\rVert} \norm{y-X\beta}_2 = \frac{(\beta-\hat\beta)X’X(\beta-\hat\beta)}{\norm{y’(I-P_x)y}}$, where $P_x$ is the projection matrix. This is the form of an ellipse.

  • Lasso-Ridge