RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Kunlun Zhu,Yifan Luo,Dingling Xu,Ruobing Wang,Shi Yu,Shuo Wang,Yukun Yan,Zhenghao Liu,Xu Han,Zhiyuan Liu,Maosong Sun

2024-10-17

Abstract:Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance to rigorously evaluate LLM-generated responses. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications.

Computation and Language,Information Retrieval

What problem does this paper attempt to address?

The paper attempts to address the challenges faced when evaluating the effectiveness of Retrieval-Augmented Generation (RAG) systems in specific scenarios. Specifically, these issues include: 1. **High cost of data construction**: Building high-quality evaluation datasets in specific scenarios requires a significant amount of manpower and time, especially when specialized knowledge is needed. 2. **Lack of suitable evaluation metrics**: Existing evaluation metrics often fail to comprehensively and accurately measure the performance of RAG systems, particularly when domain-specific knowledge or factual accuracy is required. 3. **Limited coverage**: Existing RAG benchmarks have limited coverage across different scenarios, making it difficult to meet the needs of practical applications. To address these issues, the paper proposes a framework called RAGEval, which can automatically generate high-quality documents, questions, answers, and references to evaluate the performance of RAG systems in various scenarios. Additionally, the paper introduces three new evaluation metrics—Completeness, Hallucination, and Irrelevance—to more rigorously assess the quality of the generated answers by RAG systems. Experimental validation shows that RAGEval outperforms zero-shot and one-shot methods in multiple aspects, and its evaluation results are highly consistent with human assessments.

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

Retrieval-Augmented Generation for Large Language Models: A Survey

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Retrieval Augmented Generation Systems: Automatic Dataset Creation, Evaluation and Boolean Agent Setup

Corrective Retrieval Augmented Generation

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Optimizing and Evaluating Enterprise Retrieval-Augmented Generation (RAG): A Content Design Perspective

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems