Semi-supervised integration of single-cell transcriptomics data

Massimo Andreatta,Léonard Hérault,Paul Gueguen,David Gfeller,Ariel J. Berenstein,Santiago J. Carmona
DOI: https://doi.org/10.1038/s41467-024-45240-z
IF: 16.6
2024-01-29
Nature Communications
Abstract:Abstract Batch effects in single-cell RNA-seq data pose a significant challenge for comparative analyses across samples, individuals, and conditions. Although batch effect correction methods are routinely applied, data integration often leads to overcorrection and can result in the loss of biological variability. In this work we present STACAS, a batch correction method for scRNA-seq that leverages prior knowledge on cell types to preserve biological variability upon integration. Through an open-source benchmark, we show that semi-supervised STACAS outperforms state-of-the-art unsupervised methods, as well as supervised methods such as scANVI and scGen. STACAS scales well to large datasets and is robust to incomplete and imprecise input cell type labels, which are commonly encountered in real-life integration tasks. We argue that the incorporation of prior cell type information should be a common practice in single-cell data integration, and we provide a flexible framework for semi-supervised batch effect correction.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper primarily addresses the issue of batch effects in single-cell RNA sequencing (scRNA-seq) data, especially when comparing data from different samples, individuals, or conditions. Batch effects can interfere with the identification of true biological differences, thus necessitating effective correction methods to integrate these datasets. The paper introduces STACAS (Semi-supervised Tool for Adjusting and Correcting Across Samples), a semi-supervised scRNA-seq data integration method that leverages prior knowledge about cell types to preserve biological variability during the integration process. Compared to existing unsupervised and supervised methods, STACAS demonstrates superior performance. Specifically, the paper addresses the following key issues: 1. **Batch Effect Correction**: By proposing a new semi-supervised method, STACAS, which can effectively reduce batch effects while maintaining biological signals as much as possible. 2. **Preservation of Biological Variability**: STACAS aims to guide the data integration process using known cell type information, thereby avoiding over-correction that could lead to the loss of biological variability. 3. **Performance Evaluation**: The paper also introduces an improved metric, CiLISI (Per Celltype Integration LISI), to better assess the quality of data integration, particularly considering biological variability. 4. **Robustness Testing**: It demonstrates that STACAS has good robustness to incomplete and inaccurate input cell type labels, which is crucial for practical applications. 5. **Application Scenarios**: The paper concludes with a case study, integrating single-cell transcriptome data from human CD8+ T cells from multiple sources to construct a multi-study reference single-cell transcriptome map, showcasing the practical value of STACAS. In summary, by introducing the novel data integration tool STACAS, this paper provides an effective solution for handling batch effects in single-cell transcriptome data and validates its superiority and practicality through empirical studies.