RAGProbe: An Automated Approach for Evaluating RAG Applications

Shangeetha Sivasothy,Scott Barnett,Stefanus Kurniawan,Zafaryab Rasool,Rajesh Vasa

2024-09-25

Abstract:Retrieval Augmented Generation (RAG) is increasingly being used when building Generative AI applications. Evaluating these applications and RAG pipelines is mostly done manually, via a trial and error process. Automating evaluation of RAG pipelines requires overcoming challenges such as context misunderstanding, wrong format, incorrect specificity, and missing content. Prior works therefore focused on improving evaluation metrics as well as enhancing components within the pipeline using available question and answer datasets. However, they have not focused on 1) providing a schema for capturing different types of question-answer pairs or 2) creating a set of templates for generating question-answer pairs that can support automation of RAG pipeline evaluation. In this paper, we present a technique for generating variations in question-answer pairs to trigger failures in RAG pipelines. We validate 5 open-source RAG pipelines using 3 datasets. Our approach revealed the highest failure rates when prompts combine multiple questions: 91% for questions when spanning multiple documents and 78% for questions from a single document; indicating a need for developers to prioritise handling these combined questions. 60% failure rate was observed in academic domain dataset and 53% and 62% failure rates were observed in open-domain datasets. Our automated approach outperforms the existing state-of-the-art methods, by increasing the failure rate by 51% on average per dataset. Our work presents an automated approach for continuously monitoring the health of RAG pipelines, which can be integrated into existing CI/CD pipelines, allowing for improved quality.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of automatically evaluating Retrieval Augmented Generation (RAG) pipelines. Specifically, most of the existing evaluations of RAG applications and pipelines rely on a manual trial - and - error process, which is not only time - consuming but also error - prone. The paper points out that automatically evaluating RAG pipelines needs to overcome challenges such as context misunderstanding, format errors, inaccurate details, and missing content. Although previous work has focused on improving evaluation metrics and enhancing components in the pipeline, they have not focused on 1) providing a pattern for capturing different types of question - answer pairs, or 2) creating a set of templates for generating question - answer pairs to support the automation of RAG pipeline evaluation. Therefore, this research aims to fill this research gap by proposing a technique for automatically generating question - answer pairs to trigger failure points in the RAG pipeline. In addition, this technique can also validate five open - source RAG pipelines and reveal the highest failure rate when prompts combine multiple questions, especially in questions involving multiple documents or a single document. The research also shows that its method is superior to the existing state - of - the - art methods in improving the effectiveness of RAG pipeline evaluation. In summary, the goal of this paper is to provide an automated tool that can continuously monitor the health of the RAG pipeline and can be integrated into existing CI/CD pipelines to improve quality.

RAGProbe: An Automated Approach for Evaluating RAG Applications

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Seven Failure Points When Engineering a Retrieval Augmented Generation System

A Methodology for Evaluating RAG Systems: A Case Study On Configuration Dependency Validation

CRAG -- Comprehensive RAG Benchmark

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems

Optimizing and Evaluating Enterprise Retrieval-Augmented Generation (RAG): A Content Design Perspective

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

InspectorRAGet: An Introspection Platform for RAG Evaluation

Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage

Evaluation of RAG Metrics for Question Answering in the Telecom Domain

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline

A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

Evaluation of Retrieval-Augmented Generation: A Survey

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems