9 Mar, 2016
Maximize $\log \pi (\theta\v x)$ wrt $\theta$.
\[\begin{aligned} \Delta\theta &= \theta_{t+1} - \theta_t \\ &= \epsilon\bk{\nabla \log p(\theta_t) + \sum_{i=1}^n \nabla\log p(x\_i\v\theta)} \end{aligned}\]At every iteration, take a subsample, $N \ll n$. \(\bc{x_{s_1},...,x_{s_N}}\)
\[\frac{1}{n} \sum_{i=1}^n \nabla \log p(x_i\v\theta) \approx \frac{1}{N} \sum_{j=1}^N \nabla \log p(x_{s_j}\v\theta)\] \[\theta_{t+1} = \theta_t + \xi_t\]where \(\xi_t = \epsilon_t\bk{\nabla \log p(\theta_t) + \frac{n}{N}\sum_{j=1}^N \nabla\log p(x_{s_j}\v\theta)}\), and $\epsilon_t$ is chosen such that
\[\begin{cases} \sum \epsilon_t^2 \lt \infty \\ \sum \epsilon_t = \infty \\ \end{cases}\]One such selection for $\epsilon$ is $\epsilon_t = a(b+t)^{-\gamma}, .5 \lt \gamma \le 1$.
This is the first application of SGD.
where $\eta_t\sim N(0,\epsilon_t)$.
Teh proposed using the above as proposal (and the acceptance rate will go to 1 as $t \rightarrow \infty$).