Abstract:Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by incorporating external knowledge sources. This method addresses common LLM limitations, including outdated information and the tendency to produce inaccurate "hallucinated" content. However, the evaluation of RAG systems is challenging, as existing benchmarks are limited in scope and diversity. Most of the current benchmarks predominantly assess question-answering applications, overlooking the broader spectrum of situations where RAG could prove advantageous. Moreover, they only evaluate the performance of the LLM component of the RAG pipeline in the experiments, and neglect the influence of the retrieval component and the external knowledge database. To address these issues, this paper constructs a large-scale and more comprehensive benchmark, and evaluates all the components of RAG systems in various RAG application scenarios. Specifically, we have categorized the range of RAG applications into four distinct types-Create, Read, Update, and Delete (CRUD), each representing a unique use case. "Create" refers to scenarios requiring the generation of original, varied content. "Read" involves responding to intricate questions in knowledge-intensive situations. "Update" focuses on revising and rectifying inaccuracies or inconsistencies in pre-existing texts. "Delete" pertains to the task of summarizing extensive texts into more concise forms. For each of these CRUD categories, we have developed comprehensive datasets to evaluate the performance of RAG systems. We also analyze the effects of various components of the RAG system, such as the retriever, the context length, the knowledge base construction, and the LLM. Finally, we provide useful insights for optimizing the RAG technology for different scenarios.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to construct a comprehensive Chinese benchmark platform—CRUD-RAG, for evaluating the performance of Retrieval-Augmented Generation (RAG) systems in Large Language Models (LLMs). Specifically, the paper attempts to address the following issues: 1. **Limitations of current benchmarks**: - Most current benchmarks focus primarily on question-answering tasks, neglecting other potential application scenarios. - Existing benchmarks often only evaluate the LLM part of the RAG system or the retriever part in knowledge-intensive scenarios, ignoring the impact of external knowledge base construction and the retriever component in non-knowledge-intensive scenarios. 2. **Need for comprehensive evaluation of RAG systems**: - Construct a large-scale and comprehensive benchmark platform that can evaluate various components of RAG systems in multiple application scenarios, including retrievers, context length, knowledge base construction, and LLM performance. 3. **Classification and evaluation of different application scenarios**: - Classify RAG application scenarios into four categories: Create, Read, Update, and Delete, and develop different datasets for each category to evaluate the performance of RAG systems. Specifically, CRUD-RAG includes the following four types of tasks: - **Text Continuation**: Improve the input text by adding external information to generate creative outputs such as poetry, stories, or code. - **Question Answering**: Use external knowledge retrieval to answer questions, addressing issues like question answering, dialogue, and reasoning. - **Hallucination Modification**: Use retrieved content to correct errors in the input text, such as spelling, grammar, or factual errors. - **Multi-Document Summarization**: Simplify the input text by improving retrieval results to remove unnecessary details, performing text summarization or simplification. Through the construction of these tasks and datasets, the paper provides a comprehensive evaluation of RAG systems and proposes optimization suggestions.

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

Benchmarking Large Language Models in Retrieval-Augmented Generation

CRAG -- Comprehensive RAG Benchmark

Retrieval-Augmented Generation for Large Language Models: A Survey

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Corrective Retrieval Augmented Generation

Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

Evaluation of Retrieval-Augmented Generation: A Survey

A Survey on Retrieval-Augmented Text Generation for Large Language Models

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

GRAMMAR: Grounded and Modular Methodology for Assessment of Closed-Domain Retrieval-Augmented Language Model

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems