Scaling Dense Representations for Single Cell with Transcriptome-Scale Context

Nicholas Ho,Caleb N. Ellington,Jinyu Hou,Sohan Addagudi,Shentong Mo,Tianhua Tao,Dian Li,Yonghao Zhuang,Hongyi Wang,Xingyi Cheng,Le Song,Eric P. Xing
DOI: https://doi.org/10.1101/2024.11.28.625303
2024-12-03
Abstract:Developing a unified model of cellular systems is a canonical challenge in biology. Recently, a wealth of public single-cell RNA sequencing data as well as rapid scaling of self-supervised learning methods have provided new avenues to address this longstanding challenge. However, rapid parameter scaling has been essential to the success of large language models in text and images, while similar scaling has not been attempted with Transformer architectures for cellular modeling. To produce accurate, transferable, and biologically meaningful representations of cellular systems, we develop AIDO.Cell, a pretrained module for representing gene expression and cellular systems in an AI-driven Digital Organism. AIDO.Cell contains a series of 3M, 10M, 100M, and 650M parameter encoder-only dense Transformer models pre-trained on 50 million human cells from diverse tissues using a read-depth-aware masked gene expression pretraining objective. Unlike previous models, AIDO.Cell is capable of handling the entire human transcriptome as input without truncation or sampling tricks, thus learning accurate and general representations of the human cell's entire transcriptional context. This pretraining with a longer context was enabled through FlashAttention-2, mixed precision, and large-scale distributed systems training. AIDO.Cell (100M) achieves state-of- the-art results in tasks such as zero-shot clustering, cell-type classification, and perturbation modeling. Our findings reveal interesting loss scaling behaviors as we increase AIDO.Cell's parameters from 3M to 650M, providing insights for future directions in single-cell modeling. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face.
Bioinformatics
What problem does this paper attempt to address?