CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells
Yuansong Zeng,jiancong xie,Zhuoyi Wei,Yun Su,Ningyuan Shangguan,Shuangyu Yang,Chengyang Zhang,Wenbing Li,Jinbo Zhang,Nan Fang,Hongyu Zhang,Huiying Zhao,Yutong Lu,Jue Fan,Weijiang Yu,Yuedong Yang
DOI: https://doi.org/10.1101/2024.06.04.597369
2024-06-06
Abstract:The rapid evolution of single-cell sequencing technologies has facilitated precise transcriptomics profiling at the single-cell level, shedding light on the intricate heterogeneity within cellular populations. Despite these advances, the inherent diversity of cells and data challenges such as noise, batch effects, and sparsity, underscores the pressing need for a unified model to learn and represent cellular states effectively. Single-cell Large Language Models (LLMs) have been crafted to bridge this gap yet exhibit limited performance on human cells. This shortfall may stem from the confounding effects of training data from diverse species, partly because of limited cells for the single species. Here, we have compiled a dataset of approximately 100 million human cells sequenced by multiple technologies from human single-cell datasets with various file types deposited in public databases and websites. Leveraging these extensive data cohorts, we developed CellFM, a robust single-cell foundation model with an impressive 800 million parameters, marking an eight-fold increase over the current largest single-species model. To ensure the training of CellFM on the MindSpore AI framework from Huawei, we have integrated RetNet, a Transformer architecture variant with linear complexity for a balance between efficiency and performance, serving as the backbone of our model. Our comprehensive experiments have shown that CellFM outperforms existing models across diverse applications, such as cell annotation, perturbation prediction, and gene function prediction.
Bioinformatics