11 Jan, 2016

Shrinkage & Regularization


Convex function:

Sparsity of $\beta$: \(\norm{\beta}_0\) is the number of $\beta$ that is 0. This is not convex. Difficult to optimize.

General penalty: $\norm{y-X\beta}^2 + \lambda\norm\beta_0$

In Ridge, once $\lambda$ is known, the solution for $\beta$ is analytic; not so in lasso. So we need algorithms for estimating $\beta$.

  • LARS (Hastie, Tibshirani, Johnstone): Least angle regression (an iteratively updated algorithm)
    • computationally as quick as forward / backward selection
    • angle b/w predictor and error is computed. Similar predictors are updated similarly, based on angle.
    • does not explore entire model space
    • R: lars package
  • Coordinate descent algorithm
    • $F(\beta) = \norm{y-X\beta}^2 + \lambda\psi(\beta)$, where $\psi$ is a convex function
    • use a Gibbs like way to optimize each iteration. Pick starting values close to truth, as with any other convex opt. (like newton-raphson).
    • glmnet package. glmnet More stable than lars package. Could be a package problem… but the two algorithms should give the same estimates. Not stable as in with poor starting values or $p \gt\gt n$, we get convergence problems. In any case, use coordinate descent.
    • cv.glmnet(X,y)

Elastic Net:

  • Short-comings of lasso
    • with lasso, at most $n$ of the $p$ variables are non-zero.
    • since the objective function for lasso is not strictly convex, you may not be able to perform group variable selection.
  • penalty function: $J(\beta) = \lambda_1\norm\beta^2 + \lambda_2\norm\beta_1$
    • hybrid of $l_1$ and $l_2$ penalty
    • strictly convex
    • $l_1$ part generates a sparse model, $l_2$ part convex and encourages grouping effect and limitations on number of selected variables.
    • Goals: 1 Group variable selection, 2 possibly select more predictors than sample size
      • goal 2: \(\def\ds{\displaystyle} \begin{array}{c} ||y-X\beta||\_2 + \lambda\_1||\beta||\_2 + \lambda\_2||\beta||\_1 \\\\ = \ds\norm{ {y \choose 0} - \frac{1}{\sqrt{1+\lambda\_2}} {X \choose \sqrt\lambda\_2 I\_p}\beta^* }\_2 + \gamma\norm{\beta^*}\_1 \\\\ \text{where,}~\beta^* = \sqrt{1+\lambda\_2}\beta, ~ \gamma = \frac{\lambda\_1}{\sqrt{1+\lambda\_2}} \\\\ \text{Set } X^* = \frac{1}{\sqrt{1+\lambda\_2}}{X \choose \sqrt\lambda\_2 I\_p} \\\\ \end{array}\)
      • now we can have at most $n+p$ predictors. i.e. all variables could be selected.
      • The solution to this problem is called the naive elastic net estimate.