24 Sep, 2015

First BNP Class

Preliminaries

Intro
Course Website
DP - Ferguson (1970)
Gibbs was first used for BNP, (1988) before general purpose Gibbs was developed.
Review papers for final class project
- logistic regression
- dp
Presentations of Final Project: 7 December, 8-11 am
R: “DPpackage” (5 years old), for interest.
References:
- Hanson, Branscum, Johnson
- Ghosh and Ramamoorthi (2004) Bayesian Nonparametrics, New York: Springer.
  - Good reference, but no data analysis
- Hjort, Holmes, Muller, Walker (2010) Bayesian Nonparametrics, Camberidge University Press.
- Muller, Mitra (2013) “Bayesian nonparametric inference - why and how”, Bayesian Analysis, 8, 269-360. (Read tonight)

Lecture Notes (2012)

Found on Kottas’ personal website

Post-Lecture Thoughts

Comparison between BNP \& Hierarchical Modeling

I was still having a hard time identifying the differences between a hierarchical bayesian model and a BNP model. So, I decided to write out the models and think about the differences. Here are two simple models.

BNP:

$\begin{array}{rclcl} y_i &|& \theta_i &\sim& F(\theta_i) \\\\ \theta_i &|& G &\sim& G \\\\ && G &\sim& \text{DP}(G_0,\alpha) \\\\ \end{array}$

Hierarchical Model:

$\begin{array}{rclcl} y_i &|& \theta_i &\sim& F(\theta_i) \\\\ && \theta_i &\sim& G \\\\ \end{array}$

In the BNP model, a possible choice for $F(\theta)$ may be N($\theta,0.1$). $G$ can then be thought of as the distribution for $\theta$. $G$ could be any distribution that fits the support of $\theta$. $G$ could be N(5,1) – who knows?! We don’t know. So, we put a prior on $G$ – The DP with “mean” $G_0$, which could be say a N(3,2), and mass parameter $\alpha$. As for the funny notation in lines 2-3, G is unknown, so it appears several times. The first time as the given, the second time as prior the distribution for $\theta$, and the third time as the distribution being given a prior distribution. Heuristically, lines 2-3 says that the distribution of the distribution for $\theta$ is centered at the distribution $G_0$. To be clear, we are not modeling the mean of or any other parameter associated with $\theta$. We are modeling the distribution of $\theta$. And if you think about it, this is the notation that makes the most sense. How else could you think of to correctly express this idea to model the distribution? There is no more concise way.

In the hierarchical model, once again, F($\theta$) could be N($\theta$,0.1), but $G$ is where the model specification stops. $G$ could be N(3,2). And that would be the end. Heuristically, $\theta$ could be any value, but we believe it is centered at 3. End of story.

Reflections of Why and How of BNP

First, some preliminaries. Recall that if we have the model $\begin{array}{rclcl} y &|& \theta &\sim& \text{Normal}(\theta,1)\\\\ & & \theta &\sim& \text{Normal}(m,s^2)\\\\ \end{array}$ or more generally, $\begin{array}{rclcl} y &|& \theta &\sim& F(\theta)\\\\ & & \theta &\sim& G \end{array}$ then the posterior $ [\theta|y] \propto [y|\theta][\theta] $. The posterior predictive can be computed as $[y^*] = \int [ y|\theta^* ] [\theta^*]d\theta^*$, where $\theta^*$ is the posterior $\theta$. Or using different notations, $f(y^*) = \displaystyle \int f(y|\theta^*) g(\theta^*) d\theta^* = \displaystyle \int f(y|\theta^*) dG(\theta^*)$. When we see this form, we have the following interpretation:

$g(\theta^*) d\theta^* \Rightarrow$ simulate draws from $g(\cdot)$
$\displaystyle \int f(y|\theta^*) \Rightarrow$ evaluate the draws at $f(y|\theta=\theta^*)$. The resulting draws evaulated at $f$ would be the posterior predictive for $y$.

For our BNP model,

\[\begin{array}{rclcl} y_i &|& \theta_i &\sim& F(\theta_i) \\\\ \theta_i &|& G &\sim& G \\\\ && G &\sim& \text{DP}(G_0,\alpha) \\\\ \end{array}\]

we can marginalize $\theta$ such that

\[[y] = \displaystyle\int [y | \theta ] [\theta] d\theta = \displaystyle\int [ y | \theta ] dG(\theta).\]

In case it’s not obvious, note that $[\theta ] d\theta = g(\theta)d\theta = dG(\theta)$, where $g$ is the pmf, $G$ is the CDF, and $G \sim DP(G_0,\alpha)$. Now, the next step is to learn how to sample from the DP to get distributions for $G$. Then, all we need to do for a basic Gibbs sampler is to draw a distribution for $G$ from the DP, then draw a $\theta$ from $G$ to fit our model.

R Code for sampling from DP

It is valuable to practice drawing from the DP. While the DP is discrete, and it is impossible to draw an infinite number of draws, we can use a large number (here 1000), and see the behavior of the DP. I have provided some R code. Change N,a, and rG to study the behavior.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
    source("cdf.R")
# Polya Urn construction:
dp <- function(rG,a,N=100,printProg=F) {
  x <- rep(0,N)
  x[1] <- rG(1)

  for (i in 2:N) {
    nx <- table(x[1:(i-1)])
    xs <-    c(     rG(1),  names(nx) ) 
    probs <- c( a/(a+N-1), nx/(a+N-1) )
    x[i] <- as.numeric( sample(xs, 1, prob=probs) )
    if (printProg) cat("\r Progress: ",i,"/",N)
  }

  x
}

make.plot <- function(a,N,B,rG,pG,col.a=.2,plot.main) {
  X <- matrix(0,B,N)
  for (i in 1:B) {
    X[i,] <- dp(rG,a,N)
    plot.cdf(X[i,],
             discrete=F,print=F,add=ifelse(i==1,F,T),
             cex=.5,pch=20,
             col=ifelse(i==B,"black",rgb(.5,.5,.5,col.a)),
             ylim=c(0,1),xlim=c(-3,5),
             ylab="Fn(x)",xlab="x",main=plot.main
             )
    cat("\r Progress:",i,"/",B)
  }

  curve(pG,add=T,lwd=2,col="red")
  X.sort <- t(apply(X,1,function(r) sort(r)))
  X.mean <- apply(X.sort,2,mean)
  plot.cdf(X.mean,discrete=F,print=F,add=T,col="blue",lwd=2)
  legend("bottomright",legend=c("Random draws from the DP",
                                "A particular draw from the DP",
                                "G0","E[G]"),
         bty="n",col=c("darkgrey","black","red","blue"),lwd=3)
}


#svg("../../../../assets/ams241/01/plots/dp_compare.svg",13,7);
par(mfrow=c(1,3))
  rG <- function(n) rnorm(n,0,1)
  pG <- function(x) pnorm(x,0,1)
  make.plot(a=1,N=100,B=500,rG,pG,col.a=.1,plot.main="a = 1")
  make.plot(a=10,N=100,B=500,rG,pG,col.a=.1,plot.main="a = 10")
  make.plot(a=100,N=100,B=500,rG,pG,col.a=.1,plot.main="a = 100")
par(mfrow=c(1,1))
#dev.off();

Sampling from the DP