Abstract:Single-cell omics has proven to be a powerful instrument for exploring cellular diversity. With advances in sequencing protocols, single-cell studies are now routinely collected from large-scale donor cohorts consisting of samples from hundreds of donors with the goal of uncovering the molecular bases of higher-level donor phenotypes of interest. For example, to better understand the mechanisms behind Alzheimer's disease, recent studies with up to hundreds of samples have investigated the relationships between single-cell omics measurements and donors' neuropathological phenotypes (e.g. Braak staging). In order to ensure the robustness of such findings, it may be desirable to aggregate data from multiple distinct donor cohorts. Unfortunately, doing so is not always straightforward, as different cohorts may be equipped with different sets of phenotype labels. Continuing the previous Alzheimer's example, recent AD study cohorts have provided various subsets of neuropathological phenotypes, cognitive testing results, and APOE genotype. Thus, it is desirable to be able to infer any missing phenotype labels such that all available cell-level data in the study of a given phenotype of interest could be used. Moreover, beyond simply imputing missing phenotype information, it is often of interest to understand which groups of cells and/or molecular features may be most predictive of a given phenotype of interest. As such, there is a pressing need for computational methods that can connect cell-level measurements with donor-level labels. However, accomplishing this task is not straightforward. While a rich literature exists on learning meaningful low-dimensional representations of cells and for inferring corresponding cell-level labels (e.g. cell type), the donor level prediction task introduces substantial additional complexity. For example, different numbers of cells may be recovered from each donor, and thus our prediction model must be able to handle arbitrary numbers of samples as input. Moreover, ideally our model would not a priori require any additional prior knowledge beyond our cell-level measurements, such as the importance of different cell types for a given prediction task. To resolve these issues, here we propose milVI (multiple instance learning variational inference), a deep generative modeling framework that explicitly accounts for donor-level phenotypes and enables inference of missing phenotype labels post-training. In order to handle varying numbers of cells per donor when inferring phenotype labels, milVI leverages recent advances in multiple instance learning. We validated milVI by applying to impute held-out Braak staging information from an Alzheimer's disease study cohort from Mathys et al, and we found that our method achieved lower error on this task compared to naive imputation methods.

Joint variational autoencoders for multimodal imputation and embedding

Joint Analysis of Single-Cell Data across Cohorts with Missing Modalities

A multi-view latent variable model reveals cellular heterogeneity in complex tissues for paired multimodal single-cell data

Multimodal Single Cell Data Integration Challenge: Results and Lessons Learned

Modal-nexus auto-encoder for multi-modality cellular data integration and imputation

Comprehensive View Embedding Learning for Single-Cell Multimodal Integration

Joint Multimodal Learning with Deep Generative Models

Integrating single-cell multimodal epigenomic data using 1D-convolutional neural networks

Partially Shared Multi-Modal Embedding Learns Holistic Representation of Cell State

Hetero-Modal Variational Encoder-Decoder for Joint Modality Completion and Segmentation

Joint inference of discrete cell types and continuous type-specific variability in single-cell datasets with MMIDAS

Improving Multimodal Joint Variational Autoencoders through Normalizing Flows and Correlation Analysis

Integrated analysis of multimodal single-cell data with structural similarity

Improving Bi-directional Generation between Different Modalities with Variational Autoencoders

The performance of deep generative models for learning joint embeddings of single-cell multi-omics data

Joint data imputation and mechanistic modelling for simulating heart-brain interactions in incomplete datasets

MIDAS: a deep generative model for mosaic integration and knowledge transfer of single-cell multimodal data

A deep generative model for capturing cell to phenotype relationships

Ensemble deep learning of embeddings for clustering multimodal single-cell omics data

Single-cell mosaic integration and cell state transfer with auto-scaling self-attention mechanism

Cross-Modal Information Recovery and Enhancement Using Multiple-Input–Multiple-Output Variational Autoencoder