Abstract:Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. ClapNQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at ClapNQ. We present baseline experiments and analysis for ClapNQ that highlight areas where there is still significant room for improvement in grounded RAG. CLAPNQ is publicly available at

What problem does this paper attempt to address?

The problem this paper attempts to address is: How to evaluate and improve the Long-form Question Answering (LFQA) capabilities of large language models (LLMs) in Retrieval Augmented Generation (RAG) systems. Specifically, the paper focuses on how to ensure that the generated answers are faithful to the supporting documents, concise, complete, and coherent in answering the questions, while also identifying unanswerable questions. ### Main Issues: 1. **Faithfulness**: The answer needs to be based on the supporting documents to ensure user confidence in the provided answer. 2. **Conciseness**: The answer should include all the information needed to answer the question but exclude irrelevant information. 3. **Completeness**: For questions requiring rich information, clear explanations, or detailed descriptions, the answer must include all necessary information. 4. **Coherence**: The answer should be extracted and combined from multiple non-contiguous text fragments into a complete response. 5. **Unanswerable Questions**: The system needs to be able to identify and correctly handle questions that cannot be answered. ### Background and Motivation: - **Insufficiency of Existing Datasets**: Existing long-form question answering datasets like ELI5, AquaMuse, etc., either lack supporting documents, do not include unanswerable questions, or the answers are not coherent and complete. - **Challenges of RAG Systems**: RAG systems need to perform well in both retrieval and generation stages, but existing benchmarks often focus on only one aspect. ### Solution: - **CLAP NQ Dataset**: The paper proposes a new benchmark dataset, CLAP NQ, specifically designed to evaluate the long-form question answering capabilities of RAG systems. This dataset has the following features: - **Faithfulness**: Each question has a supporting golden paragraph. - **Conciseness**: Answers are 3 times shorter than the original paragraphs. - **Completeness**: Answers contain all necessary information. - **Coherence**: Answers are composed of multiple non-contiguous text fragments. - **Unanswerable Questions**: Includes a portion of unanswerable questions to simulate real-world scenarios. ### Experiments and Results: - **Baseline Experiments**: The paper conducts multiple baseline experiments, including retrieval, generation, and the complete RAG pipeline, showcasing the performance of current state-of-the-art models on the CLAP NQ dataset. - **Human Evaluation**: Further validation of the model's performance through human evaluation, highlighting areas for improvement. ### Main Contributions: 1. **Creation of the CLAP NQ Dataset**: This dataset features non-contiguous relevant fragments, testing the LLM's ability to extract relevant parts while maintaining faithfulness and conciseness. 2. **Baseline Experiments**: Provides baseline experimental results of the latest models in retrieval, generation, and the complete RAG pipeline. 3. **Human Evaluation and Discussion**: Points out areas for improvement through human evaluation. Overall, this paper aims to advance research and application of RAG systems in long-form question answering tasks through the CLAP NQ dataset, particularly in terms of faithfulness, conciseness, completeness, and coherence.

CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems

CRAG -- Comprehensive RAG Benchmark

Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

Long^2RAG: Evaluating Long-Context Long-Form Retrieval-Augmented Generation with Key Point Recall

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

QPaug: Question and Passage Augmentation for Open-Domain Question Answering of LLMs

RAG based Question-Answering for Contextual Response Prediction System

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

QuALITY: Question Answering with Long Input Texts, Yes!

Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Natural Questions: A Benchmark for Question Answering Research

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Can't Remember Details in Long Documents? You Need Some R&R

RAGProbe: An Automated Approach for Evaluating RAG Applications

In Defense of RAG in the Era of Long-Context Language Models

W-RAG: Weakly Supervised Dense Retrieval in RAG for Open-domain Question Answering

GenSco: Can Question Decomposition based Passage Alignment improve Question Answering?

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models