8 Jan, 2016
Shrinkage & Regularization
OLS: Ordinary Least Square. $\hat{\beta} = \text{argmin}_\beta \norm{y-X\beta}^2 = (X’X)^{-1}X’y$
Ridge Regression:
- minimize $\norm{y-X\beta}^2$ s.t. $\norm{\beta}^2 \lt s$
- minimize $(y-X\beta)^2 + \lambda\sum \beta_j^2$
- $\tilde{\beta} = (X’X+\lambda I)^{-1}X’y$
library(MASS)
lm.ridge(y~X, lambda=lambda_vector)
- Choose $\lambda$ using cross validation (based on $\text{MSE}_k = \sum (y_i-\hat{y}_i)^2$)
- MSE = $\sum \text{MSE_k} / K$
- k-fold if $n$ is large
- leave one out if $n$ is small
- choose $\lambda$ that minimizes MSE
- Note that we don’t penalize the intercept, because it’s the coefficient for predictors.
- Typical to standardize to make scale invariant
- Disadvantage: coefficients are shrunk to zero, but not exactly zero. So, coefficients tend to be smaller than the truth, but the model doesn’t select variables.
Lasso: