Abstract:We introduce a Retrieval-Augmented Generation (RAG) system for translating user questions into accurate federated SPARQL queries over bioinformatics knowledge graphs (KGs) leveraging Large Language Models (LLMs). To enhance accuracy and reduce hallucinations in query generation, our system utilises metadata from the KGs, including query examples and schema information, and incorporates a validation step to correct generated queries. The system is available online at <a class="link-external link-http" href="http://chat.expasy.org" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to efficiently and accurately convert natural language queries into SPARQL queries on Federated Knowledge Graphs in the field of bioinformatics. Specifically, manually writing complex SPARQL queries, especially federated queries across multiple connected knowledge graphs, is a time-consuming and challenging task even for experts. Therefore, this paper proposes a method based on Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, aiming to automatically transform user questions into accurate SPARQL queries through natural language processing techniques, thereby simplifying this process. ### Main Objectives: 1. **Improve Query Accuracy**: Reduce errors and hallucinations when generating queries by leveraging metadata from knowledge graphs (such as query examples and schema information). 2. **Dynamically Adapt to Changing Datasets**: Integrate continuously changing datasets dynamically without the need for frequent model retraining. 3. **Provide a User-Friendly Interface**: Develop an online system that allows users to easily input natural language questions and obtain corresponding SPARQL query results. ### Key Points of the Solution: - **RAG System**: Combines retrieval and generation techniques to retrieve relevant context from knowledge graphs and use this context to generate more accurate queries. - **Validation and Correction Mechanism**: Ensures the correctness of the generated SPARQL queries by parsing them and checking if they conform to the knowledge graph schema. - **Modular Design**: Each component of the system can be used independently, supports different types of large language models (LLMs), and provides open-source code to facilitate community contributions and improvements. ### Evaluation and Discussion: - **Preliminary Testing**: Designed a test suite containing 13 questions, each with a reference query, to evaluate the system's performance. - **Comparison of Multiple Configurations**: Tested the system under three different configurations: 1) using only LLM; 2) using RAG without validation; 3) using RAG with validation and correction. Results showed that larger LLM models generally performed better, while query validation was particularly important for smaller LLM models, not only improving accuracy but also ensuring that the queries returned at least some relevant results. - **Future Work**: Plans to conduct comprehensive evaluations using standardized benchmarks and evaluation frameworks to further optimize the system's accuracy, usability, and overall performance. Through these methods, this paper aims to significantly reduce the time and expertise required to query federated knowledge graphs, thereby enhancing data utilization efficiency in the field of bioinformatics.

LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs

SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question Answering over a Life Science Knowledge Graph

Meta Knowledge for Retrieval Augmented Large Language Models

Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization

KG-RAG: Bridging the Gap Between Knowledge and Creativity

WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs

Logic Augmented Generation

Retrieval-Augmented Generation for Large Language Models: A Survey

LLMs Know What They Need: Leveraging a Missing Information Guided Framework to Empower Retrieval-Augmented Generation

Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Question Answering

The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation

Document Knowledge Graph to Enhance Question Answering with Retrieval Augmented Generation

Leveraging LLMs in Scholarly Knowledge Graph Question Answering

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA

Towards Evaluating Large Language Models for Graph Query Generation

Optimizing Query Generation for Enhanced Document Retrieval in RAG

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

Biomedical knowledge graph-optimized prompt generation for large language models