GeneRAG: Enhancing Large Language Models with Gene-Related Task by Retrieval-Augmented Generation

Xinyi Lin,Gelei Deng,Yuekang Li,Jingquan Ge,Joshua Wing Kei Ho,Yi Liu
DOI: https://doi.org/10.1101/2024.06.24.600176
2024-06-28
Abstract:Large Language Models (LLMs) like GPT-4 have revolutionized natural language processing and are used in gene analysis, but their gene knowledge is incomplete. Fine-tuning LLMs with external data is costly and resource-intensive. Retrieval-Augmented Generation (RAG) integrates relevant external information dynamically. We introduce GeneRAG, a framework that enhances LLMs' gene-related capabilities using RAG and the Maximal Marginal Relevance (MMR) algorithm. Evaluations with datasets from the National Center for Biotechnology Information (NCBI) show that GeneRAG outperforms GPT-3.5 and GPT-4, with a 39% improvement in answering gene questions, a 43% performance increase in cell type annotation, and a 0.25 decrease in error rates for gene interaction prediction. These results highlight GeneRAG's potential to bridge a critical gap in LLM capabilities for more effective applications in genetics.
Bioinformatics
What problem does this paper attempt to address?
This paper aims to address the issue of knowledge incompleteness in large language models (LLMs) for gene-related tasks. Although large language models like GPT-4 have made revolutionary progress in natural language processing and have been used for gene analysis, their deficiencies in gene knowledge remain evident. Directly fine-tuning these models with external data is costly and resource-intensive. To solve this problem, the paper proposes the GeneRAG framework, which utilizes Retrieval-Augmented Generation (RAG) technology combined with the Maximal Marginal Relevance (MMR) algorithm to dynamically integrate external information. Evaluation results show that GeneRAG significantly outperforms GPT-3.5 and GPT-4 in answering gene-related questions, cell type annotation, and gene interaction prediction, with improvements of 39%, 43%, and a reduction of 0.25 in error rate, respectively. These results indicate that GeneRAG has the potential to fill critical gaps in the application of current LLMs in genetics.