10 Feb, 2016

Sequential Monte Carlo (fixed parameter settings)

State of the art
Chopin (2002)

ISIS Algorithm for SMC

Suppose at every batch we have $p$ samples (i.e. the first batch = $\brac{y_1,...,y_p}$, second batch = $\brac{y_{p+1},...,y_{2p}}$, etc.)
Draw $s$ samples from $\pi_{init}$. $\pi_{init}$ can either be the prior distribution or posterior from the first batch of data.
Let these samples be $\theta_1,...,\theta_s$ and propogate the samples
At time $t$ you have samples $\theta_1^t,...,\theta_s^t$
Observe $(t+1)^{th}$ batch of data
\[w_j \propto \frac{\pi(\theta_j^t|D_1,...,D_{t+1})}{\pi(\theta_j^t|D_1,...,D_{t})} \propto f(D_{t+1}|\theta_j^t)\]
Now look at the discrete distribution $\begin{matrix} & \theta_1^t,...,\theta_s^t \\ w.p. & w_1,...,w_s \\ \end{matrix}$
We will resample $s$ samples (with replacement) from this distribution $(h_1,...,h_s)$. This resampling step will remove unimportant particles, but reduce the number of particles significantly.
Particle rejuvination step:
- sample $\theta_j^{t+1} \sim \kappa(\cdot,h_j)$ where $\kappa$ is a transition kernel whose stationary distribution is $\pi_{t+1}$. - Their proposal distribution for metropolis is $N(\mu,\Sigma)$ where $\mu=\frac{1}{s}\sum_{j=1}^s \theta_j^t$ and $\frac{1}{s}\sum_{j=1}^s (\theta_j^t-\mu)'(\theta_j^t-\mu)$. (Moves for $p \lt 20$) - when $k$ is big in $\theta^{k+1}$, particles don’t change over time very much. The entire system becomes degenerate quickly. - to avoid this, start with a large number of particles (i.e. $s$ is large). But storage overhead will be large. - In the particle rejuvination step, we carry out one step MCMC. This means we have to calculate the likelihood of $\brac{D_1,…D_t}$. Which means we need to store the entire data until time $(t+1)$. - These issues are resolved now, but were early issues in the development of the algorithm.

Other Issues

Data order matters
How do we choose $p$?
- Model complex, than $p$ needs to be larger. That is, we need to feed the model with more data at each time point.

Assumed Density Filtering

Posterior is complex. And the approximate posterior at every time point with a density comes from a suitable class of distributions. (A generalization of Laplace approximation, done in a sequential way.)

Let $D_1,...D_t$ be the data batches. Assume a class of distributions used to approximate the posterior $Q = \brac{q_\eta: \eta\in\E}$. For example, $Q=\brac{N(\mu,\Sigma): \mu\in \mathbb{R}^p, \Sigma\in \mathbb{R}^{p(p+1)/2}}$.

Start with the $D_1$. $\pi(\theta | D_1)$ is not in a closed form that we can sample from.
$\underset{\eta}{min}\brac{KL(\pi(\theta | D_1) | q_\eta(\theta))} = q_{\eta^*}(\theta)$
- $q_{\eta^*}(\theta)$ is an approximation for the posterior from time 1.
Use this as the prior of $\theta$ at time 2.

\[\pi^{ps}(\theta|D_1,D_2) = \frac{q_{\eta^*}(\theta)f(D_2|\theta)}{\int~q_{\eta^*}(\theta)f(D_2|\theta)~d\theta}\]

The approximate posterior after time 2 is $\underset{\eta}{min}\brac{KL(\pi^{ps}(\theta | D_1) | q_\eta(\theta))} = q_{\eta^{**}}(\theta)$