Abstract:Evaluating retrieval-augmented generation (RAG) systems remains challenging, particularly for open-ended questions that lack definitive answers and require coverage of multiple sub-topics. In this paper, we introduce a novel evaluation framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We propose decomposing questions into sub-questions and classifying them into three types -- core, background, and follow-up -- to reflect their roles and importance. Using this categorization, we introduce a fine-grained evaluation protocol that provides insights into the retrieval and generation characteristics of RAG systems, including three commercial generative answer engines: <a class="link-external link-http" href="http://You.com" rel="external noopener nofollow">this http URL</a>, Perplexity AI, and Bing Chat. Interestingly, we find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions, revealing clear opportunities for improvement. Further, sub-question coverage metrics prove effective for ranking responses, achieving 82% accuracy compared to human preference annotations. Lastly, we also demonstrate that leveraging core sub-questions enhances both retrieval and answer generation in a RAG system, resulting in a 74% win rate over the baseline that lacks sub-questions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to evaluate and optimize the performance of Retrieval - Augmented Generation (RAG) systems when dealing with open - ended questions, especially those open - ended questions that lack clear answers and need to cover multiple sub - topics. Specifically, the paper introduces a new evaluation framework based on sub - question coverage, aiming to measure the extent to which the RAG system can handle different aspects of a complex question. By decomposing the question into core sub - questions, background sub - questions and follow - up sub - questions and classifying these sub - questions, the paper proposes a fine - grained evaluation protocol to deeply analyze the retrieval and generation characteristics of the RAG system. In addition, the paper also explores methods of using core sub - questions to improve the retrieval and generation processes of the RAG system in order to improve the quality and comprehensiveness of responses. ### Main contributions of the paper: 1. **Introducing sub - question coverage as an evaluation metric**: The paper proposes a new evaluation framework to measure the performance of the RAG system in answering complex questions through sub - question coverage. This evaluation method not only considers whether the generated answer is accurate, but also focuses on whether the answer comprehensively covers all important aspects of the question. 2. **Classification and decomposition of sub - questions**: The paper divides sub - questions into three categories: core sub - questions, background sub - questions and follow - up sub - questions, and elaborates on the definition of each type and its role in answering the main question. This classification helps to evaluate the performance of the RAG system more meticulously. 3. **Fine - grained evaluation protocol**: Based on sub - question coverage, the paper designs a series of evaluation metrics, including sub - question coverage of answers, sub - question coverage of retrieval, core knowledge identification ability from retrieval to generation, and the potential for performance improvement through improved retrieval. 4. **Practical application and verification**: The paper evaluates three popular RAG systems (You.com, Perplexity AI and Bing Chat), shows the performance differences of these systems on different sub - question types, and puts forward improvement suggestions. 5. **Automatic answer quality evaluation**: The paper proposes an automatic answer quality evaluation method based on sub - question coverage and verifies a strong correlation between this method and human preferences, with an accuracy rate of 82%. ### Key findings of the paper: - **Importance of core sub - questions**: All the evaluated RAG systems tend to cover core sub - questions preferentially, but still about 50% of the core sub - questions are missed, indicating that there is still much room for improvement in covering core sub - questions. - **Disconnection between retrieval and generation**: Even if the relevant core information is retrieved, the RAG system often fails to effectively integrate it into the final answer, which reveals the limitations in retrieval integrity and utilization of generated content. - **Effectiveness of automatic evaluation**: The automatic evaluation method based on sub - question coverage can well approximate human perception of answer quality, especially in covering core sub - questions, and its accuracy is significantly higher than the traditional LLM - as - a - judge method. In conclusion, by introducing the concept of sub - question coverage, this paper provides a new, fine - grained evaluation framework, which not only helps to evaluate the performance of RAG systems more comprehensively, but also provides specific guidance directions for improving these systems.

Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

Evaluation of Retrieval-Augmented Generation: A Survey

Optimizing and Evaluating Enterprise Retrieval-Augmented Generation (RAG): A Content Design Perspective

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

RAGProbe: An Automated Approach for Evaluating RAG Applications

CRAG -- Comprehensive RAG Benchmark

Retrieval-Augmented Generation for Domain-Specific Question Answering: A Case Study on Pittsburgh and CMU

A Multi-Source Retrieval Question Answering Framework Based on RAG

An Adaptive Framework for Generating Systematic Explanatory Answer in Online Q&A Platforms

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

RuleRAG: Rule-guided retrieval-augmented generation with language models for question answering

A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions

Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation

REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Long^2RAG: Evaluating Long-Context Long-Form Retrieval-Augmented Generation with Key Point Recall