CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

Yuanjie Lyu,Zhiyu Li,Simin Niu,Feiyu Xiong,Bo Tang,Wenjin Wang,Hao Wu,Huanyong Liu,Tong Xu,Enhong Chen
2024-07-15
Abstract:Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This method addresses common LLM limitations, including outdated information and the tendency to produce inaccurate "hallucinated" content. However, the evaluation of RAG systems is challenging, as existing benchmarks are limited in scope and diversity. Most of the current benchmarks predominantly assess question-answering applications, overlooking the broader spectrum of situations where RAG could prove advantageous. Moreover, they only evaluate the performance of the LLM component of the RAG pipeline in the experiments, and neglect the influence of the retrieval component and the external knowledge database. To address these issues, this paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios. Specifically, we have categorized the range of RAG applications into four distinct types-Create, Read, Update, and Delete (CRUD), each representing a unique use case. "Create" refers to scenarios requiring the generation of original, varied content. "Read" involves responding to intricate questions in knowledge-intensive situations. "Update" focuses on revising and rectifying inaccuracies or inconsistencies in pre-existing texts. "Delete" pertains to the task of summarizing extensive texts into more concise forms. For each of these CRUD categories, we have developed comprehensive datasets to evaluate the performance of RAG systems. We also analyze the effects of various components of the RAG system, such as the retriever, the context length, the knowledge base construction, and the LLM. Finally, we provide useful insights for optimizing the RAG technology for different scenarios.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to construct a comprehensive Chinese benchmark platform—CRUD-RAG, for evaluating the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). Specifically, the paper attempts to address the following issues: 1. **Limitations of current benchmarks**: - Most current benchmarks focus primarily on question-answering tasks, neglecting other potential application scenarios. - Existing benchmarks often only evaluate the LLM part of the RAG system or the retriever part in knowledge-intensive scenarios, ignoring the impact of external knowledge base construction and the retriever component in non-knowledge-intensive scenarios. 2. **Need for comprehensive evaluation of RAG systems**: - Construct a large-scale and comprehensive benchmark platform that can evaluate various components of RAG systems in multiple application scenarios, including retrievers, context length, knowledge base construction, and LLM performance. 3. **Classification and evaluation of different application scenarios**: - Classify RAG application scenarios into four categories: Create, Read, Update, and Delete, and develop different datasets for each category to evaluate the performance of RAG systems. Specifically, CRUD-RAG includes the following four types of tasks: - **Text Continuation**: Improve the input text by adding external information to generate creative outputs such as poetry, stories, or code. - **Question Answering**: Use external knowledge retrieval to answer questions, addressing issues like question answering, dialogue, and reasoning. - **Hallucination Modification**: Use retrieved content to correct errors in the input text, such as spelling, grammar, or factual errors. - **Multi-Document Summarization**: Simplify the input text by improving retrieval results to remove unnecessary details, performing text summarization or simplification. Through the construction of these tasks and datasets, the paper provides a comprehensive evaluation of RAG systems and proposes optimization suggestions.