29 Jan, 2016

Compresing Predictors

If parameter estimation is not the main goal, but prediction. We don’t want to use generalized pareto priors.

Goal:

Build a preictive model of $y$ on $\underset{p \times 1}{\pmb x}$, $p \approx 50000$. $\underset{m\times p}{\Phi} \underset{p\times 1}{\pmb x}$, $m \ll p$. $\underset{m\times 1}{\Phi \pmb x}$

Papers

Compressed sensing literature (Johnson \& Lindenstrauss, 1984)
Bayesian Compressed Regression

k-nearest neighbor clustering
Image compression

Posterior

$\Sigma=2b_1 / n$
$\mu = [\Phi X’X\Phi’ + \Sigma_\beta^{-1}]^{-1}$
$a_1=n/2, b_1=[y’y - y’X\Phi’[\Phi X’X\Phi’ + \Sigma_\beta^{-1}]\Phi X’y]/2$
$\mu_{pred} = (\Phi x_0)’\mu$
$\sigma^2_{pred} = 2\frac{b_1}{n}[ 1 + (\Phi x_0)’ [\Phi X’X\Phi’]^{-1} \Phi x_0]$

Note that the posterior is obtained in closed form, while not in other methods.

How to choose $m$?

model averaging
- create model for each (specified) dimension of $\Phi$
- $P(\mathcal M_l | D = \frac{P(D|\mathcal M_l)P(\mathcal M_l)}{P(D)}$
- $P(\mathcal M_l) = 1/S$, $S$ is the number of models (i.e. number of $m$’s).
- $P(D|\mathcal M_l) = \int~ N(y|X\Phi’\beta,\sigma^2I)\pi(\beta|\sigma^2)\pi(\sigma^2) ~d\beta d\sigma^2$
  - $=\frac{1}{|X\Phi’\Sigma_\beta\Phi X’ + I|^{1/2}} \frac{2^{n/2}\Gamma(n/2)}{y’(X\Phi’\Sigma_beta\Phi X’+I)^{-1}y} (\sqrt{2\pi})^{n/2} $
- Simulate a new $\Phi$ for each model
- we don’t pick one model, but we use weights for each model and predict with each model

Model has narrower credible intervals and lower MSE, compared to lasso and ridge.