ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Jon Saad-Falcon,Omar Khattab,Christopher Potts,Matei Zaharia
2024-04-01
Abstract:Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. By creating its own synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across eight different knowledge-intensive tasks in KILT, SuperGLUE, and AIS, ARES accurately evaluates RAG systems while using only a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our code and datasets publicly available on Github.
Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The paper aims to address the issues present in the evaluation process of Retrieval-Augmented Generation (RAG) systems. Traditional RAG system evaluation methods rely on a large amount of manually annotated data, including query questions, retrieved passages, and generated responses. These methods not only consume a lot of time and resources but also require a high level of expertise. Additionally, existing automated evaluation frameworks like RAGAS, although capable of simplifying some workflows, lack flexibility in their evaluation strategies, making it difficult to adapt to different evaluation environments, and the quality of the evaluation cannot be guaranteed. To solve these problems, the authors propose an automated evaluation system named ARES (Automated RAG Evaluation System). ARES improves existing methods in the following ways: 1. **Data Efficiency**: Requires only a small amount (approximately 150) of manually annotated data points to complete the evaluation. 2. **Automated Data Generation**: Utilizes lightweight language models to generate synthetic training data to evaluate various components of the RAG system. 3. **Statistical Confidence Intervals**: Employs Prediction-Powered Inference (PPI) technology to provide statistical confidence intervals for the scoring results, enhancing the accuracy of model evaluation. 4. **Multidimensional Evaluation**: Conducts a comprehensive evaluation of the RAG system from three aspects: contextual relevance, answer fidelity, and answer relevance. Experimental results show that ARES significantly outperforms existing methods on multiple knowledge-intensive task datasets, particularly improving evaluation accuracy in contextual relevance and answer relevance by 59.3% and 14.4%, respectively. Moreover, ARES can effectively identify hallucinations in answers and reduces the amount of manual annotation required by 78% compared to traditional annotation-based methods, offering higher data efficiency. Overall, ARES provides a new solution for the rapid and accurate evaluation of RAG systems.