How do Large Language Models understand Genes and Cells

Chen Fang,Yidong Wang,Yunze Song,Qingqing Long,Wang Lu,Linghui Chen,Pengfei Wang,Guihai Feng,Yuanchun Zhou,Xin Li

DOI: https://doi.org/10.1101/2024.03.23.586383

2024-03-27

Abstract:Researching genes and their interactions is crucial for deciphering the fundamental laws of biological activity, advancing disease treatment, drug discovery and so on. Large language Models (LLMs), with their profound text comprehension and generation capabilities, have made significant strides across various natural science fields. However, their application in cell biology remains notably scarce. To alleviate this issue, in this paper, we selects seven mainstream LLMs and evaluates their performance across a range of problem scenarios. Our findings indicate that LLMs possess a certain level of understanding of genes and cells, and hold potential for solving real-world problems. Moreover, we have improved the current method of textual representation of cells, enhancing the LLMs’ ability to tackle cell annotation tasks. We encourage cell biology researchers to leverage LLMs for problem-solving while also being mindful of some challenges associated with their use. We release our code and data at .

Bioinformatics

What problem does this paper attempt to address?

The paper attempts to address the following key issues: 1. **Insufficient understanding of genes and cells**: Although large language models (LLMs) have made significant progress in the field of natural language processing (NLP) and shown potential in various natural science domains, their application in cell biology remains relatively limited. The paper aims to evaluate and improve the understanding capabilities of LLMs regarding genes and cells. 2. **Gene recognition and function prediction**: Researchers hope that LLMs can effectively identify the functions and characteristics of individual genes and predict interactions between genes. This includes, but is not limited to, gene dosage sensitivity, bivalent chromatin structure, genomic distance of transcription factors, and prediction of core regulatory elements. 3. **Cell annotation tasks**: Cells can be seen as collections of genes, with their functions and morphology determined by the selective expression of specific genes. The paper explores how to utilize LLMs for cell type annotation, which is a critical step in single-cell data analysis and essential for understanding the functions and characteristics of different cells. 4. **Improvement of text representation methods**: Existing cell text representation methods (such as "cell sentences") lack the structure of natural language, limiting the understanding capabilities of LLMs. The paper proposes an improved method—"cell sentences plus," which involves adding brief functional descriptions after each gene name to enhance the performance of LLMs in cell annotation tasks. Through these studies, the paper hopes to provide cell biology researchers with new tools and methods to solve practical problems using LLMs, while also highlighting some challenges that need to be addressed when using LLMs.

How do Large Language Models understand Genes and Cells

An Evaluation of Large Language Models in Bioinformatics Research

Genomic Language Models: Opportunities and Challenges

ChatCell: Facilitating Single-Cell Analysis with Natural Language

scReader: Prompting Large Language Models to Interpret scRNA-seq Data

Large language models in bioinformatics: applications and perspectives

Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

Large Language Models for Biomolecular Analysis: from Methods to Applications

Scientific Large Language Models: A Survey on Biological & Chemical Domains

A Comprehensive Evaluation of Large Language Models in Mining Gene Interactions and Pathway Knowledge

A comprehensive evaluation of large language models in mining gene relations and pathway knowledge

Large Language Models in Plant Biology

A Survey for Large Language Models in Biomedicine

Integrating Large Language Models in Bioinformatics Education for Medical Students: Opportunities and Challenges

Large language models reshaping molecular biology and drug development

GeneSUM: Large Language Model-based Gene Summary Extraction

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

L2G: Repurposing Language Models for Genomics Tasks

Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

Survey on large language model annotation of cellular senescence from figures in review articles

The Development of AI Foundation Models for Single-Cell Transcriptomics