9 Mar, 2016

Stochastic Gradient Descent (Bayesian)

Stochastic Gradient Descent (Robbins & Monroe 1954)

\[\log \pi (\theta\v x) = \sum_{i=1}^n \log p(x\v \theta) + \log p(\theta) + c\]

Maximize $\log \pi (\theta\v x)$ wrt $\theta$.

\[\begin{aligned} \Delta\theta &= \theta_{t+1} - \theta_t \\ &= \epsilon\bk{\nabla \log p(\theta_t) + \sum_{i=1}^n \nabla\log p(x\_i\v\theta)} \end{aligned}\]

At every iteration, take a subsample, $N \ll n$. $\bc{x_{s_1},...,x_{s_N}}$

\[\frac{1}{n} \sum_{i=1}^n \nabla \log p(x_i\v\theta) \approx \frac{1}{N} \sum_{j=1}^N \nabla \log p(x_{s_j}\v\theta)\] \[\theta_{t+1} = \theta_t + \xi_t\]

where $\xi_t = \epsilon_t\bk{\nabla \log p(\theta_t) + \frac{n}{N}\sum_{j=1}^N \nabla\log p(x_{s_j}\v\theta)}$, and $\epsilon_t$ is chosen such that

\[\begin{cases} \sum \epsilon_t^2 \lt \infty \\ \sum \epsilon_t = \infty \\ \end{cases}\]

One such selection for $\epsilon$ is $\epsilon_t = a(b+t)^{-\gamma}, .5 \lt \gamma \le 1$.

This is the first application of SGD.

Welling & Teh (2011) - Stochastic Langevin

\[\theta_{t+1} - \theta_t = \epsilon_t\bk{\nabla \log p(\theta_t) + \frac{n}{N}\sum_{j=1}^N \nabla\log p(x_{s_j}\v\theta)} + \eta_t\]

where $\eta_t\sim N(0,\epsilon_t)$.

Teh proposed using the above as proposal (and the acceptance rate will go to 1 as $t \rightarrow \infty$).

Algorithm

Suppose $\bc{\theta_t}$ are drawn as a proposal for $\theta$ in a metropolis step.
Welling & Teh intuitively argued that as $t\rightarrow\infty$, metripolis acceptance ratio $\rightarrow 1$.
No need to compute metropolis ratio.