A deep generative model for capturing cell to phenotype relationships
Ethan Weinberger,Patrick Yu,Su-In Lee
DOI: https://doi.org/10.1101/2024.08.07.606396
2024-08-09
Abstract:Single-cell omics has proven to be a powerful instrument for exploring cellular diversity. With advances in sequencing protocols, single-cell studies are now routinely collected from large-scale donor cohorts consisting of samples from hundreds of donors with the goal of uncovering the molecular bases of higher-level donor phenotypes of interest. For example, to better understand the mechanisms behind Alzheimer's disease, recent studies with up to hundreds of samples have investigated the relationships between single-cell omics measurements and donors' neuropathological phenotypes (e.g. Braak staging). In order to ensure the robustness of such findings, it may be desirable to aggregate data from multiple distinct donor cohorts. Unfortunately, doing so is not always straightforward, as different cohorts may be equipped with different sets of phenotype labels. Continuing the previous Alzheimer's example, recent AD study cohorts have provided various subsets of neuropathological phenotypes, cognitive testing results, and APOE genotype. Thus, it is desirable to be able to infer any missing phenotype labels such that all available cell-level data in the study of a given phenotype of interest could be used. Moreover, beyond simply imputing missing phenotype information, it is often of interest to understand which groups of cells and/or molecular features may be most predictive of a given phenotype of interest. As such, there is a pressing need for computational methods that can connect cell-level measurements with donor-level labels. However, accomplishing this task is not straightforward. While a rich literature exists on learning meaningful low-dimensional representations of cells and for inferring corresponding cell-level labels (e.g. cell type), the donor level prediction task introduces substantial additional complexity. For example, different numbers of cells may be recovered from each donor, and thus our prediction model must be able to handle arbitrary numbers of samples as input. Moreover, ideally our model would not a priori require any additional prior knowledge beyond our cell-level measurements, such as the importance of different cell types for a given prediction task. To resolve these issues, here we propose milVI (multiple instance learning variational inference), a deep generative modeling framework that explicitly accounts for donor-level phenotypes and enables inference of missing phenotype labels post-training. In order to handle varying numbers of cells per donor when inferring phenotype labels, milVI leverages recent advances in multiple instance learning. We validated milVI by applying to impute held-out Braak staging information from an Alzheimer's disease study cohort from Mathys et al, and we found that our method achieved lower error on this task compared to naive imputation methods.
Bioinformatics