Introduction to Automatic Differentiation Variational Inference

This page was last updated on 17 Mar, 2021.


In this post, I will give a brief overview of Automatic Differentiation Variational Inference (ADVI). Basic knowledge of Bayesian inference is assumed. Familiarity with gradient-based optimization methods like gradient descent will be helpful. I have based most of this post on this paper. This post is not meant to be comprehensive, but simply to provide a high-level overview to those who are interested in implementing ADVI. Those seeking a more in-depth treatment should review the papers linked in this post.

An example implementation built from scratch (in PyTorch) of a Linear Regression with ADVI is provided here.

Bayesian Inference

Practitioners of Bayesian methodology are typically interested in posterior distributions of model parameters. That is, given a model $p(\data \mid \theta)$, where $\theta$ are model parameters, one may be interested in $p(\theta \mid \data)$. Using Bayes rule, we know that

\[p(\theta \mid \data) = \frac{ p(\theta) p(\data\mid\theta) }{\int p(\theta) p(\data\mid\theta) d\Theta}\]

where $\Theta$ is the parameter space of $\theta$. Typically, the integral in the denominator of this expression is intractable (or cannot be obtained in closed form). One way to obtain the posterior distribution is to sample from the distribution via Markov Chain Monte Carlo (MCMC). These methods can be computationally expensive, and inefficient. That is, due to the nature of the implementation of certain sampling procedures, obtaining $B$ independent samples may require sampling $B^* > B$ samples, and then taking $B$ sub-samples to yield samples that are less auto-correlated. This procedure is called thinning. MCMC is able to provide excellent solutions to complex models, though at times can be prohibitively slow.

Variational Inference (VI)

Variational inference is an alternative to MCMC for fitting Bayesian models. Practitioners of this method are interested in the posterior distribution of model parameters, but are typically seeking methods that are faster and scalable to datasets that are large (in terms of number of observations).

In VI inference, one seeks the exact (analytic) distribution of a “close” approximation to the true posterior distribution. Formally, one seeks the solution to the following expression:

\[\argmin{\phi \in \Phi} \KL\p{q(\theta; \phi) \abs{} p(\theta \mid \data)}.\]

In the expression above, $q(\theta; \phi)$ is known as the variational density. It is a parametric density parameterized by $\phi$, which should be easy to sample from and evaluate. KL (Kullback-Leibler) divergence is a metric to measure a “distance” between two densities. Hence we are trying to minimize the dissimilarity between the approximating class of distributions and the true posterior. The exact form of KL between two densities is:

\[\KL\p{q(\theta) \abs{} p(\theta)} = \E_q\bk{\log q(\theta) - \log p(\theta)}\]

where the expectation is taken with respect $Q$ – which in VI would be the variational distribution for $\theta$.

Directly minimizing the KL divergence is difficult. Consequently, practitioners try to maximize the evidence lower bound (ELBO), which is equivalent to minimizing the KL divergence. The ELBO is:

\[\begin{aligned} \elbo(\phi) &= \E_{q(\theta;\phi)}\bk{\log p(\data\mid\theta) + \log p(\theta) - \log q(\theta; \phi)} \\ \elbo(\phi) &= \E_{q(\theta;\phi)}\bk{\log p(\data, \theta) - \log q(\theta; \phi)}. \end{aligned}\]

Therefore, the objective in VI becomes finding solutions to:

\[\phi^* = \argmax{\phi} \elbo(\phi).\]

Here, I will leave the reader to seek on their own how this is classically done. This review outlines the coordinate ascent variational inference (CAVI) algorithm for finding a solution for $\phi^*$. Suffice it to say that in CAVI, one tries to update the parameters $\phi$ in the variational distribution sequentially, until some convergence criteria is met. This requires analytic-derivations of the updates, which can be time-consuming at best, and infeasible in some cases. For instance, closed-form CAVI updates are not available in a logistic regression.

Below are some items to be cautious about:

  • Local modes
    • In iterative methods for finding a global maximum, CAVI and other VI implementations may go to one of many local modes instead.
  • Selecting $q(\theta)$
    • The most important aspect of selecting $q(\theta)$ is ensuring the support of $q(\theta)$ matches that of the prior $p(\theta)$.
  • Assessing convergence (to a local mode)
    • One can assess convergence by checking if the current evaluation of the ELBO changed from the previous by less than a certain threshold.

ADVI

As previously mentioned, CAVI is one way to maximize the ELBO. However, deriving the updates for CAVI can be difficult or impossible even in common cases. An alternative method to maximize the ELBO is automatic differentiation variational inference (ADVI). This is a gradient based method. One still uses iterative optimization procedures to obtain $\phi^*$, but instead of CAVI, one use something like (stochastic) gradient descent. This requires computing the derivatives of ELBO with respect to the parameters. This can be very tedious in complex models. In the presence of automatic differentiation (AD) libraries (like autograd and PyTorchin python), one can rely on libraries to compute derivatives accurately. How these libraries compute derivatives is out of the scope of this post. Just note that in AD libraries like those in PyTorch, derivatives are not computed numerically (in the traditional sense), nor symbolically. AD relies a representation of variables known as dual numbers to cleverly compute gradients efficiently. It is also necessary to store computational graphs and perform some sort of source code generation. This paper surveys how AD is used in machine learning.

We will first assume all model parameters are continuous. In ADVI, the ELBO is first re-written as

\[\elbo(\phi) = \E_{q(\zeta;\phi)}\bk{\log p\p{\data, T^{-1}(\zeta)} + \log \abs{\det J_{T^{-1}}(\zeta)} - \log q(\zeta; \phi)}.\]

Here, $T$ is a function that transforms $\theta$ to $\zeta$, where $\zeta \in \mathbb{R}^{\operatorname{dim}(\theta)}$. That is, $T: \operatorname{support}(\theta) \rightarrow \mathbb{R}^{\operatorname{dim}(\theta)}$. Hence, the log joint density of the data and model parameters needs to be added with the log absolute value of the determinant of the Jacobian. Note that this ELBO expression is equivalent to the previous one. As all the (transformed) model parameters $\zeta$ have support on the real line, a suitable variational distribution for $\zeta$ is a (multivariate) Normal distributions (as it has the same support).

Having a multivariate Normal as the variational distribution allows us to compute the expectation (in the ELBO) and its gradient using a Monte Carlo estimate. Specifically, to estimate the ELBO, one can sample values from the variational (Normal) distributions, and evaluate the expression inside the expectation above. (The authors of the ADVI paper argue that in practice, one sample is sufficient.)

To maximize the ELBO, the gradient of the ELBO with respect to the variational parameters is required. That is

\[\nabla_\phi\elbo(\phi) = \nabla_\phi \E_{q(\zeta;\phi)}\bk{\log p\p{\data, T^{-1}(\zeta)} + \log \abs{\det J_{T^{-1}}(\zeta)} - \log q(\zeta; \phi)}.\]

Again, we can evaluate this using Monte Carlo integration. Taking the gradient of a random variate is not straight-forward, however. So, we should first draw a standard Normal random variate, then multiply the random variate by the variational standard deviation and variational mean. Thus, we can push the gradient inside the expectation. More explicitly,

\[\nabla_\phi\elbo(\phi) \approx \log p\p{\data, T^{-1}(\tilde\zeta)} + \log \abs{\det J_{T^{-1}}(\tilde\zeta)} - \log q(\tilde\zeta; \phi)\]

where $\tilde{\zeta} = \mu + z \sigma$, and $z$ is a draw from a standard Normal, and $(\mu, \sigma)$ are the variational mean and standard deviations (which can be vectors, in which case $z$ is a multivariate standard Normal, etc.).

What remains is performing some kind of gradient descent to obtain solutions for $\phi$. Note that the variational parameters should first be transformed into the unconstrained space. That is the standard deviations should be modeled on the log scale. But no Jacobian will be necessary. Also note that this ADVI does not capture correlation between parameters. However, by modeling the variational distribution as a multivariate normal, modeling correlation between model parameters is possible at a computational cost.


Here, I leave the reader to review the ADVI paper to learn the details of the methodology. Check out an example implementation of a Linear Regression with ADVI.

References