Abstract:Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. By creating its own synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across eight different knowledge-intensive tasks in KILT, SuperGLUE, and AIS, ARES accurately evaluates RAG systems while using only a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our code and datasets publicly available on Github.

What problem does this paper attempt to address?

The paper aims to address the issues present in the evaluation process of Retrieval-Augmented Generation (RAG) systems. Traditional RAG system evaluation methods rely on a large amount of manually annotated data, including query questions, retrieved passages, and generated responses. These methods not only consume a lot of time and resources but also require a high level of expertise. Additionally, existing automated evaluation frameworks like RAGAS, although capable of simplifying some workflows, lack flexibility in their evaluation strategies, making it difficult to adapt to different evaluation environments, and the quality of the evaluation cannot be guaranteed. To solve these problems, the authors propose an automated evaluation system named ARES (Automated RAG Evaluation System). ARES improves existing methods in the following ways: 1. **Data Efficiency**: Requires only a small amount (approximately 150) of manually annotated data points to complete the evaluation. 2. **Automated Data Generation**: Utilizes lightweight language models to generate synthetic training data to evaluate various components of the RAG system. 3. **Statistical Confidence Intervals**: Employs Prediction-Powered Inference (PPI) technology to provide statistical confidence intervals for the scoring results, enhancing the accuracy of model evaluation. 4. **Multidimensional Evaluation**: Conducts a comprehensive evaluation of the RAG system from three aspects: contextual relevance, answer fidelity, and answer relevance. Experimental results show that ARES significantly outperforms existing methods on multiple knowledge-intensive task datasets, particularly improving evaluation accuracy in contextual relevance and answer relevance by 59.3% and 14.4%, respectively. Moreover, ARES can effectively identify hallucinations in answers and reduces the amount of manual annotation required by 78% compared to traditional annotation-based methods, offering higher data efficiency. Overall, ARES provides a new solution for the rapid and accurate evaluation of RAG systems.

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Evaluation of Retrieval-Augmented Generation: A Survey

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

DIRAS: Efficient LLM Annotation of Document Relevance in Retrieval Augmented Generation

RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering

EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations

AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline

CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

Retrieval Augmented Generation Systems: Automatic Dataset Creation, Evaluation and Boolean Agent Setup

Evaluating Retrieval Quality in Retrieval-Augmented Generation

Optimizing and Evaluating Enterprise Retrieval-Augmented Generation (RAG): A Content Design Perspective

RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Generation