11 Jan, 2016
Shrinkage & Regularization
Convex function:
Sparsity of $\beta$: \(\norm{\beta}_0\) is the number of $\beta$ that is 0. This is not convex. Difficult to optimize.
General penalty: $\norm{y-X\beta}^2 + \lambda\norm\beta_0$
In Ridge, once $\lambda$ is known, the solution for $\beta$ is analytic; not so in lasso. So we need algorithms for estimating $\beta$.
- LARS (Hastie, Tibshirani, Johnstone): Least angle regression (an iteratively updated algorithm)
- computationally as quick as forward / backward selection
- angle b/w predictor and error is computed. Similar predictors are updated similarly, based on angle.
- does not explore entire model space
- R:
lars
package
- Coordinate descent algorithm
- $F(\beta) = \norm{y-X\beta}^2 + \lambda\psi(\beta)$, where $\psi$ is a convex function
- use a Gibbs like way to optimize each iteration. Pick starting values close to truth, as with any other convex opt. (like newton-raphson).
glmnet
package. glmnet
More stable than lars
package. Could be a package problem… but the two algorithms should give the same estimates. Not stable as in with poor starting values or $p \gt\gt n$, we get convergence problems. In any case, use coordinate descent.
cv.glmnet(X,y)
Elastic Net:
- Short-comings of lasso
- with lasso, at most $n$ of the $p$ variables are non-zero.
- since the objective function for lasso is not strictly convex, you may not
be able to perform group variable selection.
- penalty function: $J(\beta) = \lambda_1\norm\beta^2 + \lambda_2\norm\beta_1$
- hybrid of $l_1$ and $l_2$ penalty
- strictly convex
- $l_1$ part generates a sparse model, $l_2$ part convex and encourages grouping effect and limitations on number of selected variables.
- Goals: 1 Group variable selection, 2 possibly select more predictors than sample size
- goal 2:
\(\def\ds{\displaystyle}
\begin{array}{c}
||y-X\beta||\_2 + \lambda\_1||\beta||\_2 + \lambda\_2||\beta||\_1 \\\\
= \ds\norm{ {y \choose 0} - \frac{1}{\sqrt{1+\lambda\_2}} {X \choose \sqrt\lambda\_2 I\_p}\beta^* }\_2 + \gamma\norm{\beta^*}\_1 \\\\
\text{where,}~\beta^* = \sqrt{1+\lambda\_2}\beta, ~ \gamma = \frac{\lambda\_1}{\sqrt{1+\lambda\_2}} \\\\
\text{Set } X^* = \frac{1}{\sqrt{1+\lambda\_2}}{X \choose \sqrt\lambda\_2 I\_p} \\\\
\end{array}\)
- now we can have at most $n+p$ predictors. i.e. all variables could be selected.
- The solution to this problem is called the naive elastic net estimate.