Abstract:Single-cell RNA sequencing (scRNA-seq) data is a potent tool for comprehending the "language of life" and can provide insights into various downstream biomedical tasks. Large-scale language models (LLMs) are starting to be used for cell representation learning. However, current LLM-based cell representation learning methods depend solely on the BERT architecture, causing an anisotropic embedding space that leads to inefficient semantic representation. Contrastive learning alleviates this problem by distributing the embeddings uniformly. As a larger batch size in contrastive learning results in better representation, the practical application of contrastive learning in cell representation learning is hampered by the high dimensionality of scRNA-seq data and the large parameter volume of LLMs. To address the batch size limitation, we propose a novel divide-and-conquer contrastive learning approach to decouple the batch size from the GPU memory size for cell representation learning. Based on our divide-and-conquer contrastive learning approach, we introduce Single-Cell Language Model CellLM, a large-scale cell representation learning model to handle high-dimensional scRNA-seq data with tens of thousands of genes. CellLM has over 50 million parameters trained with 2 million scRNA-seq data and makes the first attempt to learn cell language models from both normal cells and cancer cells. CellLM achieves new state-of-the-art (SOTA) results in all evaluated downstream tasks: including a 71.8 F_1-score for cell type annotation (a 3.0% absolute improvement over scBERT), an average F_1-score of 88.9 for single-cell drug sensitivity prediction in a few-shot scenario (an 8.3% absolute improvement), and a 93.4 Pearson's correlation for single-omics cell line drug sensitivity prediction (a 6.2% absolute improvement).

scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis

scReader: Prompting Large Language Models to Interpret scRNA-seq Data

scInterpreter: Training Large Language Models to Interpret scRNA-seq Data for Cell Type Annotation

BioLLM: A Standardized Framework for Integrating and Benchmarking Single-Cell Foundation Models

Scmulan: a Multitask Generative Pre-Trained Language Model for Single-Cell Analysis

CELLama: Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities

Evaluating the Utilities of Foundation Models in Single-cell Data Analysis

Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning

scMMT: a multi-use deep learning approach for cell annotation, protein prediction and embedding in single-cell RNA-seq data

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

scEMB: Learning context representation of genes based on large-scale single-cell transcriptomics

GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling

Parameter-Efficient Fine-Tuning Enhances Adaptation of Single Cell Large Language Model for Cell Type Identification

scKEPLM: Knowledge enhanced large-scale pre-trained language model for single-cell transcriptomics

Large-scale foundation model on single-cell transcriptomics

Scmoe: Single-Cell Multi-Modal Multi-Task Learning Via Sparse Mixture-of-Experts

ChatCell: Facilitating Single-Cell Analysis with Natural Language

Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

scMoE: single-cell mixture of experts for learning hierarchical, cell-type-specific, and interpretable representations from heterogeneous scRNA-seq data

scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics

scMODAL: A general deep learning framework for comprehensive single-cell multi-omics data alignment with feature links