MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang,Yi Yang
2024-01-27
Abstract:Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multi-hop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop-RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence. Both experiments reveal that existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries. We hope MultiHop-RAG will be a valuable resource for the community in developing effective RAG systems, thereby facilitating greater adoption of LLMs in practice. The MultiHop-RAG and implemented RAG system is publicly available at
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing Retrieval - Augmented Generation (RAG) systems perform inadequately when dealing with multi - hop queries. Multi - hop queries require retrieving and reasoning about supporting evidence from multiple documents to provide answers, and the existing evaluation benchmark datasets for RAG systems are not specifically designed for such complex query types. Therefore, the authors developed a new dataset, MultiHop - RAG, which aims to evaluate and promote the ability of RAG systems to handle multi - hop queries. Specifically, the main contributions of the paper include: 1. **Constructed the MultiHop - RAG dataset**: This dataset contains a knowledge base, a large number of multi - hop queries, the correct answers to these queries, and related supporting evidence. 2. **Classified multi - hop queries**: Based on RAG queries in practical applications, multi - hop queries are divided into four categories: Inference Query, Comparison Query, Temporal Query, and Null Query. 3. **Evaluated the retrieval performance of different embedding models**: Through experiments, the performance of different embedding models in retrieving evidence related to multi - hop queries was compared. 4. **Evaluated the generation performance of various LLMs**: Tested the ability of multiple state - of - the - art LLMs, including GPT - 4, PaLM, Llama2 - 70B, etc., to generate answers when given the retrieved evidence. Through these efforts, the paper hopes that MultiHop - RAG can become an important resource for the community to develop and evaluate RAG systems, thereby promoting the wide application of generative AI in practice.