luiarthur.github.io/dissertation-web-slides

Bayesian Modeling for Heterogeneous Multivariate Data



Arthur Lui
Advisor: Juhee Lee
5 March, 2021

Department of Statistics
UC Santa Cruz

A Bayesian Feature Allocation Model for Identifying Cell Subpopulations Using Cytometry Data


Cytometry at time-of-flight (CyTOF)

  • Commercialized in 2009

  • Makes use of time-of-flight spectrometry to accelerate, separate, and identify ions by mass

  • Enables detection of many parameters (biological, phenotypic, or functional markers) in less time and at a higher resolution

  • Led to greater understanding of natural killer (NK) cells

Natural Killer Cells

  • Natural Killer cells play a critical role in cancer immunosurveillance.

  • NK cell diversity affects antiviral response.

  • Drs. Thall and Rezvani, at MD Anderson Cancer Center, have conducted clinical trials to study the potential clinical efficacy of umbilical cord blood (UCB) transplantation as a therapy for leukemia.

  • UCB NK cell therapy has the advantage of low risk of viral transmission from donor to recipient .

  • In the trials, leukemia patients received UCB cell transplants, and NK cell surface markers are measured using mass cytometry.

CyTOF Data

Table 1: Cord-blood sample marker expression levels for 6 of 32 NK-cell markers (columns), and 6 of 41474 cells (rows). Last row contains cutoff values returned by CyTOF instrument.
  • Data missing not at random
    • Some markers contain up to 85% missing values
  • Cutoff values are computed after measurement

CyTOF Data

Table 2: Cell subpopulations (rows).

Obtaining cell subpopulations using overly-simplistic may yield an unreasonably high number of subpopulations.

Objective

Figure 1: Given marker expression samples, identify potential latent NK cell subpopulations and their abundances in each sample. Note the pervasiveness of missing data (black cells).

Existing Methods


  • Most existing methods use traditional clustering methods (K-means, hierarchical clustering, density-based clustering, nearest-neighbor clustering, etc.)

  • For high-dimensional cytometry data, compared existing clustering methods including FlowSOM , PhenoGraph , Rclusterpp , and flowClust .

  • Existing methods do not directly model latent subpopulations or quantify model uncertainty

Bayesian Feature Allocation Model for Heterogeneous Cell Populations – Notation

  • $I$: Number of samples
  • $J$: Number of markers
  • $N_i$: Number of observations in sample $i$.
  • $y_{i,n,j}$: Raw expression levels for observation $n$, in samples $i$, for marker $j$. (For $ỹ_{i,n,j} \ge 0$)
  • $c_{i,j}$: Cutoff for marker $j$, sample $i$
  • $y_{i,n,j}$: Transformed expression levels for observation $n$, sample $i$, marker $j$
    \[y_{i,n,j}=\log(\tilde{y}_{i,n,j}/c_{i,j}) \in \mathbb{R}.\]
    • $(y_{i,n,j} \gg 0)$ likely corresponds to expression
    • $(y_{i,n,j} \ll 0)$ likely corresponds to non-expression

Bayesian Feature Allocation Model for Heterogeneous Cell Populations

  • $\bm Z$: $(J \times K)$ binary matrix defining the latent subpopulations.
    • if $Z_{j,k} = 1$, then marker $j$ is expressed in subpopulation $k$
    • if $Z_{j,k} = 0$, then marker $j$ is not expressed in subpopulation $k$
  • $K$ is a sufficiently large constant

  • $\lambda_{i,n} \in \{1,…,K\}$: The latent subpopulation of observation $n$, sample $i$

Sampling Distribution

\[\begin{aligned} y_{i,n,j} \mid \bm\eta_{i,j}, \bm\mu^\star, \sigma^2_i, \bm Z, \lambda_{i,n}=k \ind \begin{cases} F_{0,i,j}, &\text{if }z_{j,k}=0,\\ F_{1,i,j}, &\text{if }z_{j,k}=1.\\ \end{cases} \end{aligned}\]
  • $F_{0,i,j} = \sum_{\ell=1}^{L^0} \eta^0_{i,j,\ell} \cdot \text{Normal}(\mu^\star_{0,\ell}, \sigma^2_i)$
  • $F_{1,i,j} = \sum_{\ell=1}^{L^1} \eta^1_{i,j,\ell} \cdot \text{Normal}(\mu^\star_{1,\ell}, \sigma^2_i)$
Figure 2: Kernel density estimate of samples from $F_0$ (blue) and $F_1$ (red).
$$\int_0^\infty \theta e^{-\theta x} dx = 1$$


A Bayesian Model for Identifying Distinct Features that Define Cell Subpopulations from Cytometry Data





A Bayesian Differential Distribution Approach for Zero-inflated Data with Applications to Cytometry Data



References

Backup slides for FAM.
Backup slides for rFAM.
Backup slides for zinf.