LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration

Yukun Cao,Zengyi Gao,Zhiyang Li,Xike Xie,S Kevin Zhou
2024-11-06
Abstract:GraphRAG addresses significant challenges in Retrieval-Augmented Generation (RAG) by leveraging graphs with embedded knowledge to enhance the reasoning capabilities of Large Language Models (LLMs). Despite its promising potential, the GraphRAG community currently lacks a unified framework for fine-grained decomposition of the graph-based knowledge retrieval process. Furthermore, there is no systematic categorization or evaluation of existing solutions within the retrieval process. In this paper, we present LEGO-GraphRAG, a modular framework that decomposes the retrieval process of GraphRAG into three interconnected modules: subgraph-extraction, path-filtering, and path-refinement. We systematically summarize and classify the algorithms and neural network (NN) models relevant to each module, providing a clearer understanding of the design space for GraphRAG instances. Additionally, we identify key design factors, such as Graph Coupling and Computational Cost, that influence the effectiveness of GraphRAG implementations. Through extensive empirical studies, we construct high-quality GraphRAG instances using a representative selection of solutions and analyze their impact on retrieval and reasoning performance. Our findings offer critical insights into optimizing GraphRAG instance design, ultimately contributing to the advancement of more accurate and contextually relevant LLM applications.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve several key challenges in the **GraphRAG** (Graph - based Retrieval - Augmented Generation) framework: 1. **Lack of a unified framework**: Currently, in the GraphRAG community, there is a lack of a unified framework to systematically classify and evaluate existing solutions (i.e., algorithms and neural network models). This leads to the inability to effectively summarize and classify existing GraphRAG work, and also hinders the clear identification of the actual effectiveness of specific solutions in the GraphRAG process. 2. **Insufficient modularity**: Current research often regards GraphRAG as a whole process without modular decomposition. This approach blurs the different contributions of each potential module to the overall performance. A more fine - grained modular GraphRAG framework will be helpful for analyzing the trade - offs between module performance and solution selection, and providing guidance for designing GraphRAG instances that meet the requirements of specific scenarios. ### Solutions To solve the above problems, the paper proposes a unified and modular research framework, named **LEGO - GraphRAG**, and establishes three key criteria: 1. **Modularization of GraphRAG**: LEGO - GraphRAG decomposes the process of retrieving "inference paths" into three interconnected and flexible modules: **Subgraph - Extraction**, **Path - Filtering**, and **Path - Refinement**. 2. **Solutions for GraphRAG**: LEGO - GraphRAG systematically summarizes and classifies the algorithms or neural network models available for each module, thus providing a clear understanding of the potential design space of GraphRAG instances. 3. **Design factors of GraphRAG**: LEGO - GraphRAG identifies two main factors that affect the design of GraphRAG instances, namely **Graph Coupling** and **Computational Cost**, and analyzes how these factors affect the available solutions for each module. ### Experiments and analysis Using the LEGO - GraphRAG framework, the author constructs some high - quality GraphRAG instances, combining the most representative algorithms or neural network models in various types of solutions, while ensuring comprehensive coverage of different solution types for each module. Through extensive empirical research, the author thoroughly analyzes the overall retrieval performance and LLM - based reasoning performance of these instances, and synthesizes the experimental results to identify several key insights in the development of GraphRAG instances from multiple analytical perspectives. ### Conclusions By proposing the LEGO - GraphRAG framework, the paper not only fills the gap in systematic classification and evaluation in the GraphRAG field, but also provides a theoretical basis and practical guidance for designing more efficient and accurate GraphRAG instances, ultimately promoting the development of large - language models in a wider range of application scenarios.