LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

Nicholas Pipitone,Ghita Houir Alami
2024-08-20
Abstract:Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of Large Language Models (LLMs) in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at <a class="link-external link-https" href="https://github.com/zeroentropy-cc/legalbenchrag" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the gap in the evaluation of the retrieval component of Retrieval-Augmented Generation (RAG) systems in the legal domain. Specifically: - **Main Contribution**: The paper introduces LegalBench-RAG, the first benchmark dataset specifically designed to evaluate the retrieval step of RAG systems in the legal domain. This benchmark emphasizes the importance of precisely retrieving minimal and highly relevant text fragments from legal documents, rather than retrieving entire document IDs or large, imprecise text blocks. - **Dataset Characteristics**: LegalBench-RAG consists of 6,858 query-answer pairs manually annotated by legal experts, covering a legal corpus of over 79 million characters. Each query corresponds to one or more precise text fragments extracted from the original documents. - **Practical Application**: By providing such a specialized benchmark, LegalBench-RAG becomes an important tool for enterprises and researchers to improve the accuracy and performance of RAG systems in the legal domain. In summary, the goal of this paper is to fill the existing gap in benchmark tests for evaluating the retrieval capabilities of RAG systems in the legal domain, thereby advancing the technology in this field.