Abstract:Retrieval-Augmented Generation (RAG) aims to enhance large language models (LLMs) to generate more accurate and reliable answers with the help of the retrieved context from external knowledge sources, thereby reducing the incidence of hallucinations. Despite the advancements, evaluating these systems remains a crucial research area due to the following issues: (1) Limited data diversity: The insufficient diversity of knowledge sources and query types constrains the applicability of RAG systems; (2) Obscure problems location: Existing evaluation methods have difficulty in locating the stage of the RAG pipeline where problems occur; (3) Unstable retrieval evaluation: These methods often fail to effectively assess retrieval performance, particularly when the chunking strategy changes. To tackle these challenges, we propose a Comprehensive Full-chain Evaluation (CoFE-RAG) framework to facilitate thorough evaluation across the entire RAG pipeline, including chunking, retrieval, reranking, and generation. To effectively evaluate the first three phases, we introduce multi-granularity keywords, including coarse-grained and fine-grained keywords, to assess the retrieved context instead of relying on the annotation of golden chunks. Moreover, we release a holistic benchmark dataset tailored for diverse data scenarios covering a wide range of document formats and query types. We demonstrate the utility of the CoFE-RAG framework by conducting experiments to evaluate each stage of RAG systems. Our evaluation method provides unique insights into the effectiveness of RAG systems in handling diverse data scenarios, offering a more nuanced understanding of their capabilities and limitations.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key issues in the evaluation of existing Retrieval-Augmented Generation (RAG) systems: 1. **Limited Data Diversity**: Current evaluation methods mainly rely on well-formatted plain text knowledge sources crawled from HTML, lacking support for complex documents (e.g., PDFs). Additionally, these methods primarily focus on simple queries (usually factual queries), which limits their ability to handle more complex analytical or tutorial queries. 2. **Unclear Problem Localization**: Most existing evaluation methods mainly focus on end-to-end results without conducting stage-by-stage analysis. The RAG process can be divided into several stages: chunking, retrieval, reranking, and generation. Evaluating only the final generated results makes it difficult to identify issues in specific stages, leading to low interpretability and optimization efficiency. 3. **Unstable Retrieval Evaluation**: Current evaluation methods often rely on annotated "golden chunks" to assess retrieval performance. This approach is often ineffective in evaluating retrieval performance, especially when chunking strategies change. Re-annotating all chunks is a tedious and labor-intensive process. To address these issues, the authors propose a Comprehensive Full-chain Evaluation framework (CoFE-RAG) to facilitate thorough evaluation of the entire RAG process, including chunking, retrieval, reranking, and generation. To effectively evaluate the first three stages, the authors introduce multi-granularity keywords (including coarse-grained and fine-grained keywords) to assess the retrieved context instead of relying on annotated "golden chunks." Additionally, the authors release a benchmark dataset covering various data scenarios, including different document formats and query types. By experimentally evaluating each stage of the RAG system, the authors demonstrate the practicality of the CoFE-RAG framework and provide an in-depth understanding of the effectiveness and limitations of RAG systems in handling diverse data scenarios.

CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

Evaluation of Retrieval-Augmented Generation: A Survey

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation

CORAG: A Cost-Constrained Retrieval Optimization System for Retrieval-Augmented Generation

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG

Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

LightRAG: Simple and Fast Retrieval-Augmented Generation

RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

Fine-Grained Guidance for Retrievers: Leveraging LLMs' Feedback in Retrieval-Augmented Generation

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Corrective Retrieval Augmented Generation