Contrastive Learning for Robust Cell Annotation and Representation from Single-Cell Transcriptomics

Leo Andrekson,Rocío Mercado
DOI: https://doi.org/10.1101/2024.06.20.599868
2024-06-24
Abstract:Batch effects are a significant concern in single-cell RNA sequencing (scRNA-Seq) data analysis, where variations in the data can be attributed to factors unrelated to cell types. This can make downstream analysis a challenging task. In this study, we present a novel deep learning approach using contrastive learning and a carefully designed loss function for learning an generalizable embedding space from scRNA-Seq data. We call this model CELLULAR: CELLUlar contrastive Learning for Annotation and Representation. When benchmarked against multiple established methods for scRNA-Seq integration, CELLULAR outperforms existing methods in learning a generalizable embedding space on multiple datasets. Cell annotation was also explored as a downstream application for the learned embedding space. When compared against multiple well-established methods, CELLULAR demonstrates competitive performance with top cell classification methods in terms of accuracy, balanced accuracy, and F1 score. CELLULAR is also capable of performing novel cell type detection. These findings aim to quantify the of the embedding space learned by the model by highlighting the robust performance of our learned cell representations in various applications. The model has been structured into an open-source Python package, specifically designed to simplify and streamline its usage for bioinformaticians and other scientists interested in cell representation learning.
Bioinformatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the batch effects in single - cell RNA sequencing (scRNA - Seq) data analysis. That is, the variation in the data may be caused by factors unrelated to cell types, which makes downstream analysis difficult. Specifically, the paper proposes a new method named CELLULAR, aiming to learn a generalized embedding space from scRNA - Seq data through contrastive learning and a carefully designed loss function. This method can not only reduce the influence of batch effects but also improve the accuracy of cell - type annotation and new - cell - type detection. ### Main research objectives: 1. **Reduce batch effects**: Develop a method that can effectively reduce batch effects in scRNA - Seq data, enabling data from different sources to be better integrated in the embedding space. 2. **Cell - type annotation**: Use the learned embedding space for cell - type annotation and compare it with other existing methods to verify its performance in cell - classification tasks. 3. **New - cell - type detection**: Explore the application of the learned embedding space in detecting new cell types and evaluate its performance in identifying unknown cell types. ### Method overview: - **Contrastive learning**: Through contrastive learning techniques, the model can learn the similarities and differences between different cell types, thus better representing cell types in the embedding space. - **Loss function**: A composite loss function that combines contrastive loss and cell - type centroid loss is designed to ensure the generalization ability and biological significance of the embedding space. - **Dataset**: Use multiple existing scRNA - Seq datasets for benchmark testing, including bone marrow, PBMC, kidney, pancreas, and combined datasets. ### Main results: - **Reduce batch effects**: Through UMAP visualization and scIB benchmark testing, CELLULAR performs excellently in reducing batch effects and can cluster cells of the same type together without being affected by non - biological factors such as patient ID. - **Cell - type annotation**: On multiple datasets, CELLULAR performs better than or is comparable to existing top - level methods in cell - type annotation tasks, especially in terms of accuracy, balanced accuracy, and F1 - score. - **New - cell - type detection**: CELLULAR can identify new cell types by setting likelihood thresholds, demonstrating its potential in detecting unknown cell types. ### Conclusion: Through contrastive learning and the composite loss function, CELLULAR successfully reduces batch effects in scRNA - Seq data and performs excellently in cell - type annotation and new - cell - type detection tasks. These results indicate that CELLULAR is a powerful tool that can provide more accurate and generalized cell representations in single - cell transcriptomics research.