Bayesian Modeling for Heterogeneous Multivariate Data

Arthur Lui
Advisor: Juhee Lee
5 March, 2021

Department of Statistics
UC Santa Cruz

A Bayesian Feature Allocation Model for Identifying Cell Subpopulations Using Cytometry Data

Cytometry at time-of-flight (CyTOF)

Commercialized in 2009
Makes use of time-of-flight spectrometry to accelerate, separate, and identify ions by mass
Enables detection of many parameters (biological, phenotypic, or functional markers) in less time and at a higher resolution
Led to greater understanding of natural killer (NK) cells

Natural Killer Cells

Natural Killer cells play a critical role in cancer immunosurveillance.
NK cell diversity affects antiviral response.
Drs. Thall and Rezvani, at MD Anderson Cancer Center, have conducted clinical trials to study the potential clinical efficacy of umbilical cord blood (UCB) transplantation as a therapy for leukemia.
UCB NK cell therapy has the advantage of low risk of viral transmission from donor to recipient .
In the trials, leukemia patients received UCB cell transplants, and NK cell surface markers are measured using mass cytometry.

CyTOF Data

Table 1: Cord-blood sample marker expression levels for 6 of 32 NK-cell markers (columns), and 6 of 41474 cells (rows). Last row contains cutoff values returned by CyTOF instrument.

Data missing not at random
- Some markers contain up to 85% missing values
Cutoff values are computed after measurement

CyTOF Data

Obtaining cell subpopulations using overly-simplistic may yield an unreasonably high number of subpopulations.

Objective

Figure 1: Given marker expression samples, identify potential latent NK cell subpopulations and their abundances in each sample. Note the pervasiveness of missing data (black cells).

Existing Methods

Most existing methods use traditional clustering methods (K-means, hierarchical clustering, density-based clustering, nearest-neighbor clustering, etc.)
For high-dimensional cytometry data, compared existing clustering methods including FlowSOM , PhenoGraph , Rclusterpp , and flowClust .
Existing methods do not directly model latent subpopulations or quantify model uncertainty

Bayesian Feature Allocation Model for Heterogeneous Cell Populations – Notation

$I$: Number of samples
$J$: Number of markers
$N_i$: Number of observations in sample $i$.
$y_{i,n,j}$: Raw expression levels for observation $n$, in samples $i$, for marker $j$. (For $ỹ_{i,n,j} \ge 0$)
$c_{i,j}$: Cutoff for marker $j$, sample $i$
$y_{i,n,j}$: Transformed expression levels for observation $n$, sample $i$, marker $j$
\[y_{i,n,j}=\log(\tilde{y}_{i,n,j}/c_{i,j}) \in \mathbb{R}.\]
- $(y_{i,n,j} \gg 0)$ likely corresponds to expression
- $(y_{i,n,j} \ll 0)$ likely corresponds to non-expression

Bayesian Feature Allocation Model for Heterogeneous Cell Populations

$\bm Z$: $(J \times K)$ binary matrix defining the latent subpopulations.
- if $Z_{j,k} = 1$, then marker $j$ is expressed in subpopulation $k$
- if $Z_{j,k} = 0$, then marker $j$ is not expressed in subpopulation $k$
$K$ is a sufficiently large constant
$\lambda_{i,n} \in \{1,…,K\}$: The latent subpopulation of observation $n$, sample $i$

Sampling Distribution

\[\begin{aligned} y_{i,n,j} \mid \bm\eta_{i,j}, \bm\mu^\star, \sigma^2_i, \bm Z, \lambda_{i,n}=k \ind \begin{cases} F_{0,i,j}, &\text{if }z_{j,k}=0,\\ F_{1,i,j}, &\text{if }z_{j,k}=1.\\ \end{cases} \end{aligned}\]

$F_{0,i,j} = \sum_{\ell=1}^{L^0} \eta^0_{i,j,\ell} \cdot \text{Normal}(\mu^\star_{0,\ell}, \sigma^2_i)$
$F_{1,i,j} = \sum_{\ell=1}^{L^1} \eta^1_{i,j,\ell} \cdot \text{Normal}(\mu^\star_{1,\ell}, \sigma^2_i)$

Figure 2: Kernel density estimate of samples from $F_0$ (blue) and $F_1$ (red).

$$\int_0^\infty \theta e^{-\theta x} dx = 1$$

A Bayesian Model for Identifying Distinct Features that Define Cell Subpopulations from Cytometry Data

A Bayesian Differential Distribution Approach for Zero-inflated Data with Applications to Cytometry Data

References

Backup slides for FAM.

Backup slides for rFAM.

Backup slides for zinf.