LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

Nicholas Pipitone,Ghita Houir Alami

2024-08-20

Abstract:Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of Large Language Models (LLMs) in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at <a class="link-external link-https" href="https://github.com/zeroentropy-cc/legalbenchrag" rel="external noopener nofollow">this https URL</a>.

Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the gap in the evaluation of the retrieval component of Retrieval-Augmented Generation (RAG) systems in the legal domain. Specifically: - **Main Contribution**: The paper introduces LegalBench-RAG, the first benchmark dataset specifically designed to evaluate the retrieval step of RAG systems in the legal domain. This benchmark emphasizes the importance of precisely retrieving minimal and highly relevant text fragments from legal documents, rather than retrieving entire document IDs or large, imprecise text blocks. - **Dataset Characteristics**: LegalBench-RAG consists of 6,858 query-answer pairs manually annotated by legal experts, covering a legal corpus of over 79 million characters. Each query corresponds to one or more precise text fragments extracted from the original documents. - **Practical Application**: By providing such a specialized benchmark, LegalBench-RAG becomes an important tool for enterprises and researchers to improve the accuracy and performance of RAG systems in the legal domain. In summary, the goal of this paper is to fill the existing gap in benchmark tests for evaluating the retrieval capabilities of RAG systems in the legal domain, thereby advancing the technology in this field.

LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

CodeRAG-Bench: Can Retrieval Augment Code Generation?

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models

CBR-RAG: Case-Based Reasoning for Retrieval Augmented Generation in LLMs for Legal Question Answering

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

LegalAgentBench: Evaluating LLM Agents in Legal Domain

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Benchmarking Retrieval-Augmented Generation for Medicine

The Power of Noise: Redefining Retrieval for RAG Systems

Toward Optimal Search and Retrieval for RAG

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

CRAG -- Comprehensive RAG Benchmark

Benchmarking Large Language Models in Retrieval-Augmented Generation