Abstract:The rapid advancement of single-cell sequencing technology has significantly deepened our understanding of cellular heterogeneity, yet it concurrently presents substantial challenges for the unified modeling of single-cell data. Simultaneously, pre-trained foundation models have achieved notable success in domains such as natural language processing and image analysis. However, extending these models to accommodate ultra-long single-cell transcriptome sequences, characterized by an extensive number of genes, remains a formidable task. In this study, we introduce SC-MAMBA2, based on the MAMBA2 architecture, meticulously designed with a bidirectional modeling approach tailored for single-cell transcriptomic data. As the first single-cell foundation model to integrate state-space models (SSMs) underlying MAMBA2 architecture, SC-MAMBA2 features over 625 million parameters, covers more than 60,000 genes, and was pre-trained on a dataset of over 57 million cells, making it the most comprehensive solution for processing ultra-long transcriptome sequences. Extensive benchmarking across a diverse array of downstream tasks consistently demonstrates that SC-MAMBA2 surpasses state-of-the-art models, delivering superior accuracy and enhanced computational efficiency.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the efficient modeling of single - cell ultra - long transcriptome data. Specifically, researchers face the following challenges: 1. **Complexity of single - cell data**: With the rapid development of single - cell sequencing technology, researchers can understand cell heterogeneity more in - depth, but this also brings great challenges to the unified modeling of single - cell data. In particular, single - cell data usually contains a large number of genes, and how to efficiently process these ultra - long sequences is a difficult problem. 2. **Limitations of existing models**: Although pre - trained foundation models have achieved remarkable success in fields such as natural language processing and image analysis, it is still difficult to extend these models to process ultra - long single - cell transcriptome sequences. Existing single - cell foundation models mainly rely on masked learning and have not fully exploited the potential of generative models. In addition, the quadratic complexity of the attention mechanism leads to high resource consumption during training and inference, making it difficult for the model to be extended to larger data sets. To solve the above problems, researchers proposed **SC - MAMBA2**, a generative foundation model based on state - space models (SSMs), specifically designed for the efficient modeling of single - cell transcriptome data. The main features of SC - MAMBA2 include: - **Innovative architecture**: SC - MAMBA2 is the first model to combine state - space models with the MAMBA framework, which can efficiently and scalably process large - scale gene sequences, overcoming the computational efficiency limitations of traditional Transformer architectures when processing large - scale biological data. - **Long - sequence modeling**: Through unique design modifications and bidirectional modeling methods, SC - MAMBA2 can process full - length gene sequences containing more than 60,530 genes, which is the longest sequence currently processed in the field of single - cell transcriptomics. This enables SC - MAMBA2 to comprehensively analyze the entire gene transcript and capture complex biological variations and regulatory elements. - **Powerful performance**: Through extensive benchmark tests, researchers have proven that SC - MAMBA2 outperforms existing state - of - the - art models in multiple downstream tasks (such as gene expression quantification, cell - type classification, and trajectory inference). SC - MAMBA2 can not only capture the full picture of transcriptome information but also maintain computational efficiency, promoting the wide application of generative foundation models in transcriptomics. In summary, through its innovative architecture and efficient modeling capabilities, SC - MAMBA2 solves the key problems in the modeling of single - cell ultra - long transcriptome data, laying the foundation for more comprehensive and efficient single - cell data analysis.

SC-MAMBA2: Leveraging State-Space Models for Efficient Single-Cell Ultra-Long Transcriptome Modeling

SC-MAMBA2: Leveraging State-Space Models for Efficient Single-Cell Ultra-Long Transcriptome Modeling

scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics

CELLama: Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities

scMMT: a multi-use deep learning approach for cell annotation, protein prediction and embedding in single-cell RNA-seq data

Sctab: Scaling Cross-Tissue Single-Cell Annotation Models

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

BarcodeMamba: State Space Models for Biodiversity Analysis

scReader: Prompting Large Language Models to Interpret scRNA-seq Data

scAMACE: Model-based approach to the joint analysis of single-cell data on chromatin accessibility, gene expression and methylation

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

BioLLM: A Standardized Framework for Integrating and Benchmarking Single-Cell Foundation Models

scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis

SCEMENT: Scalable and Memory Efficient Integration of Large-scale Single Cell RNA-sequencing Data

Scme: a Dual-Modality Factor Model for Single-Cell Multiomics Embedding

Bi-Mamba: Towards Accurate 1-Bit State Space Models

Large-scale foundation model on single-cell transcriptomics

A deep generative model for multi-view profiling of single-cell RNA-seq and ATAC-seq data

scEMB: Learning context representation of genes based on large-scale single-cell transcriptomics

SurvMamba: State Space Model with Multi-grained Multi-modal Interaction for Survival Prediction

MAMBA: a model-driven, constraint-based multiomic integration method