RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Kunlun Zhu,Yifan Luo,Dingling Xu,Ruobing Wang,Shi Yu,Shuo Wang,Yukun Yan,Zhenghao Liu,Xu Han,Zhiyuan Liu,Maosong Sun
2024-10-17
Abstract:Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance to rigorously evaluate LLM-generated responses. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications.
Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the challenges faced when evaluating the effectiveness of Retrieval-Augmented Generation (RAG) systems in specific scenarios. Specifically, these issues include: 1. **High cost of data construction**: Building high-quality evaluation datasets in specific scenarios requires a significant amount of manpower and time, especially when specialized knowledge is needed. 2. **Lack of suitable evaluation metrics**: Existing evaluation metrics often fail to comprehensively and accurately measure the performance of RAG systems, particularly when domain-specific knowledge or factual accuracy is required. 3. **Limited coverage**: Existing RAG benchmarks have limited coverage across different scenarios, making it difficult to meet the needs of practical applications. To address these issues, the paper proposes a framework called RAGEval, which can automatically generate high-quality documents, questions, answers, and references to evaluate the performance of RAG systems in various scenarios. Additionally, the paper introduces three new evaluation metrics—Completeness, Hallucination, and Irrelevance—to more rigorously assess the quality of the generated answers by RAG systems. Experimental validation shows that RAGEval outperforms zero-shot and one-shot methods in multiple aspects, and its evaluation results are highly consistent with human assessments.