Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage

Kaige Xie,Philippe Laban,Prafulla Kumar Choubey,Caiming Xiong,Chien-Sheng Wu
2024-10-21
Abstract:Evaluating retrieval-augmented generation (RAG) systems remains challenging, particularly for open-ended questions that lack definitive answers and require coverage of multiple sub-topics. In this paper, we introduce a novel evaluation framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We propose decomposing questions into sub-questions and classifying them into three types -- core, background, and follow-up -- to reflect their roles and importance. Using this categorization, we introduce a fine-grained evaluation protocol that provides insights into the retrieval and generation characteristics of RAG systems, including three commercial generative answer engines: <a class="link-external link-http" href="http://You.com" rel="external noopener nofollow">this http URL</a>, Perplexity AI, and Bing Chat. Interestingly, we find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions, revealing clear opportunities for improvement. Further, sub-question coverage metrics prove effective for ranking responses, achieving 82% accuracy compared to human preference annotations. Lastly, we also demonstrate that leveraging core sub-questions enhances both retrieval and answer generation in a RAG system, resulting in a 74% win rate over the baseline that lacks sub-questions.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to evaluate and optimize the performance of Retrieval - Augmented Generation (RAG) systems when dealing with open - ended questions, especially those open - ended questions that lack clear answers and need to cover multiple sub - topics. Specifically, the paper introduces a new evaluation framework based on sub - question coverage, aiming to measure the extent to which the RAG system can handle different aspects of a complex question. By decomposing the question into core sub - questions, background sub - questions and follow - up sub - questions and classifying these sub - questions, the paper proposes a fine - grained evaluation protocol to deeply analyze the retrieval and generation characteristics of the RAG system. In addition, the paper also explores methods of using core sub - questions to improve the retrieval and generation processes of the RAG system in order to improve the quality and comprehensiveness of responses. ### Main contributions of the paper: 1. **Introducing sub - question coverage as an evaluation metric**: The paper proposes a new evaluation framework to measure the performance of the RAG system in answering complex questions through sub - question coverage. This evaluation method not only considers whether the generated answer is accurate, but also focuses on whether the answer comprehensively covers all important aspects of the question. 2. **Classification and decomposition of sub - questions**: The paper divides sub - questions into three categories: core sub - questions, background sub - questions and follow - up sub - questions, and elaborates on the definition of each type and its role in answering the main question. This classification helps to evaluate the performance of the RAG system more meticulously. 3. **Fine - grained evaluation protocol**: Based on sub - question coverage, the paper designs a series of evaluation metrics, including sub - question coverage of answers, sub - question coverage of retrieval, core knowledge identification ability from retrieval to generation, and the potential for performance improvement through improved retrieval. 4. **Practical application and verification**: The paper evaluates three popular RAG systems (You.com, Perplexity AI and Bing Chat), shows the performance differences of these systems on different sub - question types, and puts forward improvement suggestions. 5. **Automatic answer quality evaluation**: The paper proposes an automatic answer quality evaluation method based on sub - question coverage and verifies a strong correlation between this method and human preferences, with an accuracy rate of 82%. ### Key findings of the paper: - **Importance of core sub - questions**: All the evaluated RAG systems tend to cover core sub - questions preferentially, but still about 50% of the core sub - questions are missed, indicating that there is still much room for improvement in covering core sub - questions. - **Disconnection between retrieval and generation**: Even if the relevant core information is retrieved, the RAG system often fails to effectively integrate it into the final answer, which reveals the limitations in retrieval integrity and utilization of generated content. - **Effectiveness of automatic evaluation**: The automatic evaluation method based on sub - question coverage can well approximate human perception of answer quality, especially in covering core sub - questions, and its accuracy is significantly higher than the traditional LLM - as - a - judge method. In conclusion, by introducing the concept of sub - question coverage, this paper provides a new, fine - grained evaluation framework, which not only helps to evaluate the performance of RAG systems more comprehensively, but also provides specific guidance directions for improving these systems.