8 Jan, 2016
             Shrinkage & Regularization 
 
            
            
                OLS: Ordinary Least Square. $\hat{\beta} = \text{argmin}_\beta \norm{y-X\beta}^2 = (X’X)^{-1}X’y$
Ridge Regression:
  - minimize $\norm{y-X\beta}^2$ s.t. $\norm{\beta}^2 \lt s$
 
  - minimize $(y-X\beta)^2 + \lambda\sum \beta_j^2$
 
  - $\tilde{\beta} = (X’X+\lambda I)^{-1}X’y$
 
  library(MASS) 
  lm.ridge(y~X, lambda=lambda_vector) 
  - Choose $\lambda$ using cross validation (based on $\text{MSE}_k = \sum (y_i-\hat{y}_i)^2$)
    
      - MSE = $\sum \text{MSE_k} / K$
 
      - k-fold if $n$ is large
 
      - leave one out if $n$ is small
 
      - choose $\lambda$ that minimizes MSE
 
    
   
  - Note that we don’t penalize the intercept, because it’s the coefficient for predictors.
 
  - Typical to standardize to make scale invariant
 
  - Disadvantage: coefficients are shrunk to zero, but not exactly zero. So, coefficients tend to be smaller than the truth, but the model doesn’t select variables.
 
Lasso: