Abstract:The increasing availability of large-scale single-cell datasets has enabled the detailed description of cell states across multiple biological conditions and perturbations. In parallel, recent advances in unsupervised machine learning, particularly in transfer learning, have enabled fast and scalable mapping of these new single-cell datasets onto reference atlases. The resulting large-scale machine learning models however often have millions of parameters, rendering interpretation of the newly mapped datasets challenging. Here, we propose expiMap, a deep learning model that enables interpretable reference mapping using biologically understandable entities, such as curated sets of genes and gene programs. The key concept is the substitution of the uninterpretable nodes in an autoencoder’s bottleneck by labeled nodes mapping to interpretable lists of genes, such as gene ontologies, biological pathways, or curated gene sets, for which activities are learned as constraints during reconstruction. This is enabled by the incorporation of predefined gene programs into the reference model, and at the same time allowing the model to learn de novo new programs and refine existing programs during reference mapping. We show that the model retains similar integration performance as existing methods while providing a biologically interpretable framework for understanding cellular behavior. We demonstrate the capabilities of expiMap by applying it to 15 datasets encompassing five different tissues and species. The interpretable nature of the mapping revealed unreported associations between interferon signaling via the RIG-I/MDA5 and GPCRs pathways, with differential behavior in CD8+ T cells and CD14+ monocytes in severe COVID-19, as well as the role of annexins in the cellular communications between lymphoid and myeloid compartments for explaining patient response to the applied drugs. Finally, expiMap enabled the direct comparison of a diverse set of pancreatic beta cells from multiple studies where we observed a strong, previously unreported correlation between the unfolded protein response and asparagine N-linked glycosylation. Altogether, expiMap enables the interpretable mapping of single cell transcriptome data sets across cohorts, disease states and other perturbations. ### Competing Interest Statement Fabian J. Theis consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd, and Omniscope Ltd, and has ownership interest in Dermagnostix GmbH and Cellarity.

Prediction of context-specific regulatory programs and pathways using interpretable deep learning

VEGA is an interpretable generative model for inferring biological network activity in single-cell transcriptomics

Out-of-distribution Prediction with Disentangled Representations for Single-Cell RNA Sequencing Data

Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders

scVAE: variational auto-encoders for single-cell gene expression data

Biologically Interpretable VAE with Supervision for Transcriptomics Data Under Ordinal Perturbations

Variational autoencoders learn transferrable representations of metabolomics data

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Decoding regulatory structures and features from epigenomics profiles: A Roadmap-ENCODE Variational Auto-Encoder (RE-VAE) model

Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-VAE

ScInfoVAE: interpretable dimensional reduction of single cell transcription data with variational autoencoders and extended mutual information regularization

CoupleVAE: coupled variational autoencoders for predicting perturbational single-cell RNA sequencing data

Learning interpretable latent autoencoder representations with annotations of feature sets

Biologically Informed Deep Learning to Infer Gene Program Activity in Single Cells

Modeling conditional distributions of neural and behavioral data with masked variational autoencoders

Extracting a Biologically Latent Space of Lung Cancer Epigenetics with Variational Autoencoders.

Evaluating deep variational autoencoders trained on pan-cancer gene expression

Interpretable models for scRNA-seq data embedding with multi-scale structure preservation

Variational and Explanatory Neural Networks for Encoding Cancer Profiles and Predicting Drug Responses

Interpretable factor models of single-cell RNA-seq via variational autoencoders

XOmiVAE: an interpretable deep learning model for cancer classification using high-dimensional omics data