Abstract:The question-answering system for Life science research, which is characterized by the rapid pace of discovery, evolving insights, and complex interactions among knowledge entities, presents unique challenges in maintaining a comprehensive knowledge warehouse and accurate information retrieval. To address these issues, we introduce BioRAG, a novel Retrieval-Augmented Generation (RAG) with the Large Language Models (LLMs) framework. Our approach starts with parsing, indexing, and segmenting an extensive collection of 22 million scientific papers as the basic knowledge, followed by training a specialized embedding model tailored to this domain. Additionally, we enhance the vector retrieval process by incorporating a domain-specific knowledge hierarchy, which aids in modeling the intricate interrelationships among each query and context. For queries requiring the most current information, BioRAG deconstructs the question and employs an iterative retrieval process incorporated with the search engine for step-by-step reasoning. Rigorous experiments have demonstrated that our model outperforms fine-tuned LLM, LLM with search engines, and other scientific RAG frameworks across multiple life science question-answering tasks.

What problem does this paper attempt to address?

The paper is primarily dedicated to addressing the challenges faced by automatic question-answering systems in the field of biological research, particularly in the context of rapidly evolving biological studies. It focuses on how to construct a system capable of effectively handling complex knowledge structures, maintaining knowledge updates, and possessing efficient information retrieval capabilities. To tackle these issues, the authors propose the BIORAG (Biological Retrieval-Augmented Generation) framework. This framework combines large language models (LLM) with retrieval-augmented generation (RAG) technology, aiming to build an efficient question reasoning system for the biological domain. Specifically, BIORAG addresses the problem through the following steps: 1. **Constructing a high-quality knowledge base**: First, a large amount of biomedical literature was obtained from sources such as the National Center for Biotechnology Information (NCBI) and preprocessed to ensure data quality. After screening and processing, these documents formed a high-quality corpus containing over 22 million literature abstracts. 2. **Developing a specialized embedding model**: A specialized embedding model for the biological field was constructed based on PubMedBERT and fine-tuned using CLIP technology to enhance the model's understanding and retrieval capabilities for biological questions. 3. **Integrating external information sources**: To ensure the system's timeliness and accuracy, the paper also introduces various external databases and search engines as supplementary information sources, including gene databases, single nucleotide polymorphism (SNP) databases, protein databases, and general search engines like Google and Bing. 4. **Self-assessment mechanism**: BIORAG integrates a self-assessment mechanism that can evaluate whether the currently collected information is sufficient to answer the posed question. If the internal information is insufficient, it further utilizes external tools for extended searches. 5. **Customized prompts**: To better utilize the retrieved information, a series of customized prompt statements were designed to guide the model on how to select appropriate retrieval methods, rewrite query statements, perform retrieval operations, and ultimately generate answers based on the questions. Experiments on multiple biology-related question-answering datasets validated the effectiveness and superiority of BIORAG. The paper demonstrates that BIORAG outperforms other baseline methods in handling specific problems in the biomedical field, especially those requiring highly specialized knowledge. In summary, the main goal of this paper is to develop an automatic question-answering system capable of coping with the rapid changes and complexities in the field of biological research, thereby promoting interdisciplinary collaboration and the effective integration of biological knowledge.

BioRAG: A RAG-LLM Framework for Biological Question Reasoning

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering

LightRAG: Simple and Fast Retrieval-Augmented Generation

REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering

A Multi-Source Retrieval Question Answering Framework Based on RAG

ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents

QA-RAG: Exploring LLM Reliance on External Knowledge

WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs

StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning

Improving Retrieval for RAG based Question Answering Models on Financial Documents