inVAE: Conditionally invariant representation learning for generating multivariate single-cell reference maps

Hananeh Aliee,Ferdinand Kapl,Duy Pham,Batuhan Cakir,Takahiro Jimba,James Cranley,Sarah A. Teichmann,Kerstin B. Meyer,Roser Vento-Tormo,Fabian J. Theis
DOI: https://doi.org/10.1101/2024.12.06.627196
2024-12-12
Abstract:Single-cell data is driving new insights into the spatiotemporal dynamics of cells and individual disease susceptibility. However, accurately identifying cell states across diverse cohorts remains challenging, as both biological variation and technical biases cause distributional shifts in the data. Separating these effects is crucial for capturing cellular heterogeneity and ensuring interpretability. To address this, we developed inVAE, a conditionally invariant deep generative model based on variational autoencoders. inVAE models the latent space as a combination of invariant variables, encoding true biological signals, and spurious variables, capturing technical biases. By conditioning the prior distribution of cells on biological covariates, such as disease variants, inVAE identifies high-resolution cell states in the invariant representation. Enforcing independence between the two representations disentangles biological signals from noise, enabling a more interpretable and generalizable model with a causal semantic. inVAE outperformed existing methods across four human cellular atlases of the human heart and lung, while uncovering novel cell states. It precisely stratified cell atlas donors based on the genetic impact of pathogenic variants, and excelled in predicting cell types and disease in unseen data, proving its generalizability as a reference model for label transfer. Furthermore, inVAE accurately identified temporal cell states and trajectories from developmental datasets, and captured spatial cell states in a spatially resolved atlas. In summary, inVAE provides a powerful method for integrating multivariate single-cell transcriptomics data. By leveraging prior knowledge such as metadata, it effectively accounts for biological variation and improves latent space interpretability by disentangling biological and technical sources of variation. These capabilities enable deeper insights into cellular heterogeneity and its role in disease progression.
Bioinformatics
What problem does this paper attempt to address?