scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis

Tianyu Liu,Tianqi Chen,Wangjie Zheng,Xiao Luo,Hongyu Zhao
DOI: https://doi.org/10.1101/2023.12.07.569910
2024-03-03
Abstract:Various Foundation Models (FMs) have been built based on the pre-training and fine-tuning framework to analyze single-cell data with different degrees of success. In this manuscript, we propose a method named scELMo (Single-cell Embedding from Language Models), to analyze single cell data that utilizes Large Language Models (LLMs) as a generator for both the description of metadata information and the embeddings for such descriptions. We combine the embeddings from LLMs with the raw data under the zero-shot learning framework to further extend its function by using the fine-tuning framework to handle different tasks. We demonstrate that scELMo is capable of cell clustering, batch effect correction, and cell-type annotation without training a new model. Moreover, the fine-tuning framework of scELMo can help with more challenging tasks including in-silico treatment analysis or modeling perturbation. scELMo has a lighter structure and lower requirement for resources. Moreover, it is comparable to recent largescale FMs (i.e. scGPT [ ], Geneformer [ ]) based on our evaluations, suggesting a promising path for developing domain-specific FMs.
Bioinformatics
What problem does this paper attempt to address?
This paper mainly discusses how to use large language models (LLMs) to analyze single-cell data and proposes a method called scELMo. scELMo generates descriptive text and embeddings of cell or feature metadata through LLMs, and then combines these embeddings with the original data to extend its functionality in zero-shot learning framework. It handles various tasks through a fine-tuning framework, including cell clustering, batch effect correction, cell type annotation, and even simulation of treatment analysis or perturbation analysis. Compared with existing methods, scELMo has a lighter structure, lower resource requirements, and performance comparable to large-scale base models. The paper demonstrates the effectiveness of scELMo in different tasks and evaluates the LLMs used to ensure the accuracy of the generated information.