Conditionally Invariant Representation Learning for Disentangling Cellular Heterogeneity

Hananeh Aliee,Ferdinand Kapl,Soroor Hediyeh-Zadeh,Fabian J. Theis
2023-07-02
Abstract:This paper presents a novel approach that leverages domain variability to learn representations that are conditionally invariant to unwanted variability or distractors. Our approach identifies both spurious and invariant latent features necessary for achieving accurate reconstruction by placing distinct conditional priors on latent features. The invariant signals are disentangled from noise by enforcing independence which facilitates the construction of an interpretable model with a causal semantic. By exploiting the interplay between data domains and labels, our method simultaneously identifies invariant features and builds invariant predictors. We apply our method to grand biological challenges, such as data integration in single-cell genomics with the aim of capturing biological variations across datasets with many samples, obtained from different conditions or multiple laboratories. Our approach allows for the incorporation of specific biological mechanisms, including gene programs, disease states, or treatment conditions into the data integration process, bridging the gap between the theoretical assumptions and real biological applications. Specifically, the proposed approach helps to disentangle biological signals from data biases that are unrelated to the target task or the causal explanation of interest. Through extensive benchmarking using large-scale human hematopoiesis and human lung cancer data, we validate the superiority of our approach over existing methods and demonstrate that it can empower deeper insights into cellular heterogeneity and the identification of disease cell states.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
This paper proposes a novel approach to learn invariant representations under undesired variations or confounding factors by leveraging domain variance. This problem is particularly critical in biological studies, especially in the integration and classification of single-cell genomics data. Single-cell genomics data often come from different experimental conditions or laboratories, with various biological variations and technical biases. Traditional methods often struggle to distinguish relevant biological signals from noise. The main contributions of this paper include: 1. Reexamining the fundamental assumptions of invariant representation learning and pointing out that in complex biological processes, independent and invariant causal mechanisms may not be sufficient to explain all phenomena. 2. Proposing an invariant representation learning method to identify spurious variables and invariant variables. 3. Demonstrating the identifiability of the proposed method under simple transformations and permutations of latent variables. 4. Validating the effectiveness of the method in single-cell data analysis, cell state identification, and cell type annotation through large-scale human hematopoiesis and lung cancer single-cell RNA sequencing data. The researchers construct a conditionally invariant deep generative model to effectively integrate single-cell genomics data while preserving biological variations across datasets. The model is capable of incorporating specific biological mechanisms, such as gene programs, disease states, or treatment conditions, into the data integration process to deepen the understanding of cellular heterogeneity and identify disease cell states. Compared to existing methods, this approach demonstrates stronger performance in handling single-cell data integration.