scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Haotian Cui,Chloe Wang,Hassaan Maan,Kuan Pang,Fengning Luo,Nan Duan,Bo Wang
DOI: https://doi.org/10.1038/s41592-024-02201-0
IF: 48
2024-02-28
Nature Methods
Abstract:Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference.
biochemical research methods
What problem does this paper attempt to address?
The paper aims to address several key issues in single-cell multi-omics research: 1. **Cell Type Annotation**: High-precision annotation of cell types using the pre-trained model scGPT. 2. **Genetic Perturbation Prediction**: Utilizing scGPT to predict unseen gene perturbation responses, thereby extending the scope of perturbation experiments. 3. **Batch Correction and Multi-omics Integration**: Achieving integration of data from different batches and joint analysis of various omics data. 4. **Gene Network Inference**: Inferring interactions between genes by learning gene expression patterns. The core contribution of the paper is the proposal of a single-cell foundational model, scGPT, based on the generative pre-trained Transformer architecture, and demonstrating its excellent performance in the aforementioned tasks. Through pre-training on large-scale single-cell sequencing data, scGPT can effectively extract biological information and perform well in downstream tasks.