29 Jan, 2016

Compresing Predictors


If parameter estimation is not the main goal, but prediction. We don’t want to use generalized pareto priors.

Goal:

Build a preictive model of $y$ on $\underset{p \times 1}{\pmb x}$, $p \approx 50000$. $\underset{m\times p}{\Phi} \underset{p\times 1}{\pmb x}$, $m \ll p$. $\underset{m\times 1}{\Phi \pmb x}$

Papers

  1. k-nearest neighbor clustering
  2. Image compression

Posterior

  • $\Sigma=2b_1 / n$
  • $\mu = [\Phi X’X\Phi’ + \Sigma_\beta^{-1}]^{-1}$
  • $a_1=n/2, b_1=[y’y - y’X\Phi’[\Phi X’X\Phi’ + \Sigma_\beta^{-1}]\Phi X’y]/2$

  • $\mu_{pred} = (\Phi x_0)’\mu$
  • $\sigma^2_{pred} = 2\frac{b_1}{n}[ 1 + (\Phi x_0)’ [\Phi X’X\Phi’]^{-1} \Phi x_0]$

Note that the posterior is obtained in closed form, while not in other methods.

How to choose $m$?

  • model averaging
    • create model for each (specified) dimension of $\Phi$
    • $P(\mathcal M_l | D = \frac{P(D|\mathcal M_l)P(\mathcal M_l)}{P(D)}$
    • $P(\mathcal M_l) = 1/S$, $S$ is the number of models (i.e. number of $m$’s).
    • $P(D|\mathcal M_l) = \int~ N(y|X\Phi’\beta,\sigma^2I)\pi(\beta|\sigma^2)\pi(\sigma^2) ~d\beta d\sigma^2$
      • $=\frac{1}{|X\Phi’\Sigma_\beta\Phi X’ + I|^{1/2}} \frac{2^{n/2}\Gamma(n/2)}{y’(X\Phi’\Sigma_beta\Phi X’+I)^{-1}y} (\sqrt{2\pi})^{n/2} $
    • Simulate a new $\Phi$ for each model
    • we don’t pick one model, but we use weights for each model and predict with each model

Model has narrower credible intervals and lower MSE, compared to lasso and ridge.