Hierarchical Prior for Variances

Consider

\[\begin{aligned} \nu_n \mid \gamma &\sim \InverseGamma(\alpha + 1, \alpha \gamma), \text{ for } n \in \bc{1,\dots, N} \\ \gamma &\sim \GammaDist(a, \text{rate}=b) \end{aligned}\]

Note that $\gamma \mid \bm\nu \sim \GammaDist(a + N (\alpha + 1), b + \alpha \sum_{n=1}^N 1 / \nu_i)$. Moreover,

\[\begin{aligned} \E\bk{\nu_n \mid \gamma} &= \gamma \\ \Var\bk{\nu_n \mid \gamma} &= \gamma ^ 2 / (\alpha - 1) \\ \E\bk{\nu_n} &= a / b \\ \Var\bk{\nu_n} &= (\alpha + a^2) / (b^2 (\alpha - 1)). \end{aligned}\]

Note that setting $\alpha$ to be large, encourages $\nu_n$ to be close to $\gamma$. In addition,

  1. A diffuse prior for $\gamma$ with large $\alpha$ conveys that $\gamma$ should be learned from the data, but the $\nu_n$ should be close to each other (and $\gamma$). This promotes the borrowing of strength. If $\alpha$ is too large, you might as well replace the model with $\nu \sim \InverseGamma(a, b)$. i.e., remove the index.
  2. An informed (concentrated) prior for $\gamma$ but small $\alpha$ conveys that the $\nu_n$ could be far from the global variance $\gamma$. If $\alpha$ is very small, it might make sense to remove the dependence on $\gamma$ and simply model $\nu_n \sim \InverseGamma(a, b)$.
  3. A diffuse prior for $\gamma$ and small $\alpha$ could be problematic. Mixing issues will occur, as there’s little constraint in the priors to learn anything meaningful from data.
  4. A concentrated prior for $\gamma$ and large $\alpha$ means that you are certain of the value of the global variance, and the individual variances are close to the global. In the extreme case, you might consider just fixing the $\nu_n$ at the prior mean of $\gamma$ (the same value).

Thus, configurations (1) and (2) above might be most commonly used.