Abstract:Over the past decade, single-cell transcriptomic technologies have experienced remarkable advancements, enabling the simultaneous profiling of gene expressions across thousands of individual cells. Cell type identification plays an essential role in exploring tissue heterogeneity and characterizing cell state differences. With more and more well-annotated reference data becoming available, massive automatic identification methods have sprung up to simplify the annotation process on unlabeled target data by transferring the cell type knowledge. However, in practice, the target data often include some novel cell types that are not in the reference data. Most existing works usually classify these private cells as one generic 'unassigned' group and learn the features of known and novel cell types in a coupled way. They are susceptible to the potential batch effects and fail to explore the fine-grained semantic knowledge of novel cell types, thus hurting the model's discrimination ability. Additionally, emerging spatial transcriptomic technologies, such as in situ hybridization, sequencing and multiplexed imaging, present a novel challenge to current cell type identification strategies that predominantly neglect spatial organization. Consequently, it is imperative to develop a versatile method that can proficiently annotate single-cell transcriptomics data, encompassing both spatial and non-spatial dimensions. To address these issues, we propose a new, challenging yet realistic task called universal cell type identification for single-cell and spatial transcriptomics data. In this task, we aim to give semantic labels to target cells from known cell types and cluster labels to those from novel ones. To tackle this problem, instead of designing a suboptimal two-stage approach, we propose an end-to-end algorithm called scBOL from the perspective of Bipartite prototype alignment. Firstly, we identify the mutual nearest clusters in reference and target data as their potential common cell types. On this basis, we mine the cycle-consistent semantic anchor cells to build the intrinsic structure association between two data. Secondly, we design a neighbor-aware prototypical learning paradigm to strengthen the inter-cluster separability and intra-cluster compactness within each data, thereby inspiring the discriminative feature representations. Thirdly, driven by the semantic-aware prototypical learning framework, we can align the known cell types and separate the private cell types from them among reference and target data. Such an algorithm can be seamlessly applied to various data types modeled by different foundation models that can generate the embedding features for cells. Specifically, for non-spatial single-cell transcriptomics data, we use the autoencoder neural network to learn latent low-dimensional cell representations, and for spatial single-cell transcriptomics data, we apply the graph convolution network to capture molecular and spatial similarities of cells jointly. Extensive results on our carefully designed evaluation benchmarks demonstrate the superiority of scBOL over various state-of-the-art cell type identification methods. To our knowledge, we are the pioneers in presenting this pragmatic annotation task, as well as in devising a comprehensive algorithmic framework aimed at resolving this challenge across varied types of single-cell data. Finally, scBOL is implemented in Python using the Pytorch machine-learning library, and it is freely available at https://github.com/aimeeyaoyao/scBOL.

Leveraging the Cell Ontology to classify unseen cell types

Cell type discovery using single-cell transcriptomics: implications for ontological representation

Cell types and ontologies of the Human Cell Atlas

Revolutionizing Single Cell Analysis: The Power of Large Language Models for Cell Type Annotation

A machine learning one-class logistic regression model to predict stemness for single cell transcriptomics and spatial omics

ChatCell: Facilitating Single-Cell Analysis with Natural Language

Statistical Single Cell Multi-Omics Integration

ProtAnno, an Automated Cell Type Annotation Tool for Single Cell Proteomics Data that integrates information from Multiple Reference Sources

Generalized Cell Type Annotation and Discovery for Single-Cell RNA-Seq Data

scClassify: sample size estimation and multiscale classification of cells using single and multiple reference

Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

Detecting novel cell type in single-cell chromatin accessibility data via open-set domain adaptation

Realistic Cell Type Annotation and Discovery for Single-cell RNA-seq Data

Transformer for one stop interpretable cell type annotation

Pollock: fishing for cell states

scBOL: a universal cell type identification framework for single-cell and spatial transcriptomics data

Knowledge-based classification of fine-grained immune cell types in single-cell RNA-Seq data

CellGO: a Novel Deep Learning-Based Framework and Webserver for Cell-Type-specific Gene Function Interpretation

Integration for single-cell RNA sequencing data based on the shared cell type assignment

Leveraging epigenetic signatures to determine the cell-type of origin from long read sequencing data

Automated methods for cell type annotation on scRNA-seq data