MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang,Yi Yang

2024-01-27

Abstract:Retrieval-augmented generation (RAG) augments large language models (LLM) by retrieving relevant knowledge, showing promising potential in mitigating LLM hallucinations and enhancing response quality, thereby facilitating the great adoption of LLMs in practice. However, we find that existing RAG systems are inadequate in answering multi-hop queries, which require retrieving and reasoning over multiple pieces of supporting evidence. Furthermore, to our knowledge, no existing RAG benchmarking dataset focuses on multi-hop queries. In this paper, we develop a novel dataset, MultiHop-RAG, which consists of a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. We detail the procedure of building the dataset, utilizing an English news article dataset as the underlying RAG knowledge base. We demonstrate the benchmarking utility of MultiHop-RAG in two experiments. The first experiment compares different embedding models for retrieving evidence for multi-hop queries. In the second experiment, we examine the capabilities of various state-of-the-art LLMs, including GPT-4, PaLM, and Llama2-70B, in reasoning and answering multi-hop queries given the evidence. Both experiments reveal that existing RAG methods perform unsatisfactorily in retrieving and answering multi-hop queries. We hope MultiHop-RAG will be a valuable resource for the community in developing effective RAG systems, thereby facilitating greater adoption of LLMs in practice. The MultiHop-RAG and implemented RAG system is publicly available at

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing Retrieval - Augmented Generation (RAG) systems perform inadequately when dealing with multi - hop queries. Multi - hop queries require retrieving and reasoning about supporting evidence from multiple documents to provide answers, and the existing evaluation benchmark datasets for RAG systems are not specifically designed for such complex query types. Therefore, the authors developed a new dataset, MultiHop - RAG, which aims to evaluate and promote the ability of RAG systems to handle multi - hop queries. Specifically, the main contributions of the paper include: 1. **Constructed the MultiHop - RAG dataset**: This dataset contains a knowledge base, a large number of multi - hop queries, the correct answers to these queries, and related supporting evidence. 2. **Classified multi - hop queries**: Based on RAG queries in practical applications, multi - hop queries are divided into four categories: Inference Query, Comparison Query, Temporal Query, and Null Query. 3. **Evaluated the retrieval performance of different embedding models**: Through experiments, the performance of different embedding models in retrieving evidence related to multi - hop queries was compared. 4. **Evaluated the generation performance of various LLMs**: Tested the ability of multiple state - of - the - art LLMs, including GPT - 4, PaLM, Llama2 - 70B, etc., to generate answers when given the retrieved evidence. Through these efforts, the paper hopes that MultiHop - RAG can become an important resource for the community to develop and evaluate RAG systems, thereby promoting the wide application of generative AI in practice.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

EfficientRAG: Efficient Retriever for Multi-Hop Question Answering

Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

Retrieval-Augmented Generation for Large Language Models: A Survey

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Retrieve, Summarize, Plan: Advancing Multi-hop Question Answering with an Iterative Approach

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Benchmarking Large Language Models in Retrieval-Augmented Generation

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

LLMs Know What They Need: Leveraging a Missing Information Guided Framework to Empower Retrieval-Augmented Generation

Hierarchical Retrieval-Augmented Generation Model with Rethink for Multi-hop Question Answering

Layered Query Retrieval: an Adaptive Framework for Retrieval-Augmented Generation in Complex Question Answering for Large Language Models

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

Towards Multi-Source Retrieval-Augmented Generation via Synergizing Reasoning and Preference-Driven Retrieval