LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs

Vincent Emonet,Jerven Bolleman,Severine Duvaud,Tarcisio Mendes de Farias,Ana Claudia Sima
2024-10-21
Abstract:We introduce a Retrieval-Augmented Generation (RAG) system for translating user questions into accurate federated SPARQL queries over bioinformatics knowledge graphs (KGs) leveraging Large Language Models (LLMs). To enhance accuracy and reduce hallucinations in query generation, our system utilises metadata from the KGs, including query examples and schema information, and incorporates a validation step to correct generated queries. The system is available online at <a class="link-external link-http" href="http://chat.expasy.org" rel="external noopener nofollow">this http URL</a>.
Databases,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
This paper attempts to address the problem of how to efficiently and accurately convert natural language queries into SPARQL queries on Federated Knowledge Graphs in the field of bioinformatics. Specifically, manually writing complex SPARQL queries, especially federated queries across multiple connected knowledge graphs, is a time-consuming and challenging task even for experts. Therefore, this paper proposes a method based on Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, aiming to automatically transform user questions into accurate SPARQL queries through natural language processing techniques, thereby simplifying this process. ### Main Objectives: 1. **Improve Query Accuracy**: Reduce errors and hallucinations when generating queries by leveraging metadata from knowledge graphs (such as query examples and schema information). 2. **Dynamically Adapt to Changing Datasets**: Integrate continuously changing datasets dynamically without the need for frequent model retraining. 3. **Provide a User-Friendly Interface**: Develop an online system that allows users to easily input natural language questions and obtain corresponding SPARQL query results. ### Key Points of the Solution: - **RAG System**: Combines retrieval and generation techniques to retrieve relevant context from knowledge graphs and use this context to generate more accurate queries. - **Validation and Correction Mechanism**: Ensures the correctness of the generated SPARQL queries by parsing them and checking if they conform to the knowledge graph schema. - **Modular Design**: Each component of the system can be used independently, supports different types of large language models (LLMs), and provides open-source code to facilitate community contributions and improvements. ### Evaluation and Discussion: - **Preliminary Testing**: Designed a test suite containing 13 questions, each with a reference query, to evaluate the system's performance. - **Comparison of Multiple Configurations**: Tested the system under three different configurations: 1) using only LLM; 2) using RAG without validation; 3) using RAG with validation and correction. Results showed that larger LLM models generally performed better, while query validation was particularly important for smaller LLM models, not only improving accuracy but also ensuring that the queries returned at least some relevant results. - **Future Work**: Plans to conduct comprehensive evaluations using standardized benchmarks and evaluation frameworks to further optimize the system's accuracy, usability, and overall performance. Through these methods, this paper aims to significantly reduce the time and expertise required to query federated knowledge graphs, thereby enhancing data utilization efficiency in the field of bioinformatics.