How do Large Language Models understand Genes and Cells

Chen Fang,Yidong Wang,Yunze Song,Qingqing Long,Wang Lu,Linghui Chen,Pengfei Wang,Guihai Feng,Yuanchun Zhou,Xin Li
DOI: https://doi.org/10.1101/2024.03.23.586383
2024-03-27
Abstract:Researching genes and their interactions is crucial for deciphering the fundamental laws of biological activity, advancing disease treatment, drug discovery and so on. Large language Models (LLMs), with their profound text comprehension and generation capabilities, have made significant strides across various natural science fields. However, their application in cell biology remains notably scarce. To alleviate this issue, in this paper, we selects seven mainstream LLMs and evaluates their performance across a range of problem scenarios. Our findings indicate that LLMs possess a certain level of understanding of genes and cells, and hold potential for solving real-world problems. Moreover, we have improved the current method of textual representation of cells, enhancing the LLMs’ ability to tackle cell annotation tasks. We encourage cell biology researchers to leverage LLMs for problem-solving while also being mindful of some challenges associated with their use. We release our code and data at .
Bioinformatics
What problem does this paper attempt to address?
The paper attempts to address the following key issues: 1. **Insufficient understanding of genes and cells**: Although large language models (LLMs) have made significant progress in the field of natural language processing (NLP) and shown potential in various natural science domains, their application in cell biology remains relatively limited. The paper aims to evaluate and improve the understanding capabilities of LLMs regarding genes and cells. 2. **Gene recognition and function prediction**: Researchers hope that LLMs can effectively identify the functions and characteristics of individual genes and predict interactions between genes. This includes, but is not limited to, gene dosage sensitivity, bivalent chromatin structure, genomic distance of transcription factors, and prediction of core regulatory elements. 3. **Cell annotation tasks**: Cells can be seen as collections of genes, with their functions and morphology determined by the selective expression of specific genes. The paper explores how to utilize LLMs for cell type annotation, which is a critical step in single-cell data analysis and essential for understanding the functions and characteristics of different cells. 4. **Improvement of text representation methods**: Existing cell text representation methods (such as "cell sentences") lack the structure of natural language, limiting the understanding capabilities of LLMs. The paper proposes an improved method—"cell sentences plus," which involves adding brief functional descriptions after each gene name to enhance the performance of LLMs in cell annotation tasks. Through these studies, the paper hopes to provide cell biology researchers with new tools and methods to solve practical problems using LLMs, while also highlighting some challenges that need to be addressed when using LLMs.