24 Sep, 2015
I was still having a hard time identifying the differences between a hierarchical bayesian model and a BNP model. So, I decided to write out the models and think about the differences. Here are two simple models.
\(\begin{array}{rclcl} y_i &|& \theta_i &\sim& F(\theta_i) \\\\ \theta_i &|& G &\sim& G \\\\ && G &\sim& \text{DP}(G_0,\alpha) \\\\ \end{array}\)
\(\begin{array}{rclcl} y_i &|& \theta_i &\sim& F(\theta_i) \\\\ && \theta_i &\sim& G \\\\ \end{array}\)
In the BNP model, a possible choice for $F(\theta)$ may be N($\theta,0.1$). $G$ can then be thought of as the distribution for $\theta$. $G$ could be any distribution that fits the support of $\theta$. $G$ could be N(5,1) – who knows?! We don’t know. So, we put a prior on $G$ – The DP with “mean” $G_0$, which could be say a N(3,2), and mass parameter $\alpha$. As for the funny notation in lines 2-3, G is unknown, so it appears several times. The first time as the given, the second time as prior the distribution for $\theta$, and the third time as the distribution being given a prior distribution. Heuristically, lines 2-3 says that the distribution of the distribution for $\theta$ is centered at the distribution $G_0$. To be clear, we are not modeling the mean of or any other parameter associated with $\theta$. We are modeling the distribution of $\theta$. And if you think about it, this is the notation that makes the most sense. How else could you think of to correctly express this idea to model the distribution? There is no more concise way.
In the hierarchical model, once again, F($\theta$) could be N($\theta$,0.1), but $G$ is where the model specification stops. $G$ could be N(3,2). And that would be the end. Heuristically, $\theta$ could be any value, but we believe it is centered at 3. End of story.
First, some preliminaries. Recall that if we have the model \(\begin{array}{rclcl} y &|& \theta &\sim& \text{Normal}(\theta,1)\\\\ & & \theta &\sim& \text{Normal}(m,s^2)\\\\ \end{array}\) or more generally, \(\begin{array}{rclcl} y &|& \theta &\sim& F(\theta)\\\\ & & \theta &\sim& G \end{array}\) then the posterior $ [\theta|y] \propto [y|\theta][\theta] $. The posterior predictive can be computed as $[y^*] = \int [ y|\theta^* ] [\theta^*]d\theta^*$, where $\theta^*$ is the posterior $\theta$. Or using different notations, $f(y^*) = \displaystyle \int f(y|\theta^*) g(\theta^*) d\theta^* = \displaystyle \int f(y|\theta^*) dG(\theta^*)$. When we see this form, we have the following interpretation:
For our BNP model,
\[\begin{array}{rclcl} y_i &|& \theta_i &\sim& F(\theta_i) \\\\ \theta_i &|& G &\sim& G \\\\ && G &\sim& \text{DP}(G_0,\alpha) \\\\ \end{array}\]we can marginalize $\theta$ such that
\[[y] = \displaystyle\int [y | \theta ] [\theta] d\theta = \displaystyle\int [ y | \theta ] dG(\theta).\]In case it’s not obvious, note that $[\theta ] d\theta = g(\theta)d\theta = dG(\theta)$, where $g$ is the pmf, $G$ is the CDF, and $G \sim DP(G_0,\alpha)$. Now, the next step is to learn how to sample from the DP to get distributions for $G$. Then, all we need to do for a basic Gibbs sampler is to draw a distribution for $G$ from the DP, then draw a $\theta$ from $G$ to fit our model.
It is valuable to practice drawing from the DP. While the DP is discrete, and
it is impossible to draw an infinite number of draws, we can use a large number
(here 1000), and see the behavior of the DP. I have provided some R code.
Change N
,a
, and rG
to study the behavior.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
source("cdf.R")
# Polya Urn construction:
dp <- function(rG,a,N=100,printProg=F) {
x <- rep(0,N)
x[1] <- rG(1)
for (i in 2:N) {
nx <- table(x[1:(i-1)])
xs <- c( rG(1), names(nx) )
probs <- c( a/(a+N-1), nx/(a+N-1) )
x[i] <- as.numeric( sample(xs, 1, prob=probs) )
if (printProg) cat("\r Progress: ",i,"/",N)
}
x
}
make.plot <- function(a,N,B,rG,pG,col.a=.2,plot.main) {
X <- matrix(0,B,N)
for (i in 1:B) {
X[i,] <- dp(rG,a,N)
plot.cdf(X[i,],
discrete=F,print=F,add=ifelse(i==1,F,T),
cex=.5,pch=20,
col=ifelse(i==B,"black",rgb(.5,.5,.5,col.a)),
ylim=c(0,1),xlim=c(-3,5),
ylab="Fn(x)",xlab="x",main=plot.main
)
cat("\r Progress:",i,"/",B)
}
curve(pG,add=T,lwd=2,col="red")
X.sort <- t(apply(X,1,function(r) sort(r)))
X.mean <- apply(X.sort,2,mean)
plot.cdf(X.mean,discrete=F,print=F,add=T,col="blue",lwd=2)
legend("bottomright",legend=c("Random draws from the DP",
"A particular draw from the DP",
"G0","E[G]"),
bty="n",col=c("darkgrey","black","red","blue"),lwd=3)
}
#svg("../../../../assets/ams241/01/plots/dp_compare.svg",13,7);
par(mfrow=c(1,3))
rG <- function(n) rnorm(n,0,1)
pG <- function(x) pnorm(x,0,1)
make.plot(a=1,N=100,B=500,rG,pG,col.a=.1,plot.main="a = 1")
make.plot(a=10,N=100,B=500,rG,pG,col.a=.1,plot.main="a = 10")
make.plot(a=100,N=100,B=500,rG,pG,col.a=.1,plot.main="a = 100")
par(mfrow=c(1,1))
#dev.off();