LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Neel Guha,Julian Nyarko,Daniel E. Ho,Christopher Ré,Adam Chilton,Aditya Narayana,Alex Chohlas-Wood,Austin Peters,Brandon Waldon,Daniel N. Rockmore,Diego Zambrano,Dmitry Talisman,Enam Hoque,Faiz Surani,Frank Fagan,Galit Sarfaty,Gregory M. Dickinson,Haggai Porat,Jason Hegland,Jessica Wu,Joe Nudell,Joel Niklaus,John Nay,Jonathan H. Choi,Kevin Tobia,Margaret Hagan,Megan Ma,Michael Livermore,Nikon Rasumov-Rahe,Nils Holzenberger,Noam Kolt,Peter Henderson,Sean Rehaag,Sharad Goel,Shang Gao,Spencer Williams,Sunny Gandhi,Tom Zur,Varun Iyer,Zehua Li
2023-08-21
Abstract:The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.
Computation and Language,Artificial Intelligence,Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large language models (LLMs) in legal reasoning. Specifically, the authors recognize that the existing legal reasoning benchmarks have the following limitations: 1. **Limitations of Existing Benchmarks**: - **Limited Task Types**: Most existing benchmarks focus on tasks learned by fine - tuning or training models with specific - task data, and these benchmarks cannot measure the ability of LLMs to perform multiple tasks with few - shot example prompts. - **Insufficient Representation of Professional Certification Exams**: For example, existing benchmarks may focus on bar exams (such as the Uniform Bar Exam), but these exams do not necessarily represent the performance of LLMs in practical applications. - **Lack of Distinction among Different Types of Legal Reasoning**: Existing benchmarks generally classify all tasks involving legal data or law as "legal reasoning" without considering the differences in skills and knowledge required for different legal tasks. 2. **Safety and Ethical Issues**: - LLMs may generate misleading, inaccurate, or harmful content in legal applications, which is especially disadvantageous for traditionally marginalized and resource - poor populations. Therefore, it is crucial to develop LLM benchmarking infrastructure and processes for the legal context. To address these problems, the authors propose **LEGAL BENCH**, a collaboratively constructed legal reasoning benchmark aimed at evaluating the performance of LLMs on six different types of legal reasoning tasks. These tasks are designed and contributed by legal professionals, ensuring their practicality and relevance. In this way, LEGAL BENCH not only provides a systematic evaluation of the legal reasoning capabilities of LLMs but also promotes interdisciplinary dialogue, enabling legal professionals to discuss LLM performance using familiar terms and conceptual frameworks. ### Specific Objectives - **Construct a Comprehensive Legal Reasoning Benchmark**: Containing 162 tasks, covering six different types of legal reasoning (problem identification, rule recall, rule application, rule conclusion, interpretation, and rhetorical understanding). - **Promote Interdisciplinary Collaboration**: Through the active participation of legal professionals, ensure that the design of tasks is both practical and interesting. - **Support Further Research**: Provide detailed task descriptions and supporting materials to help researchers better understand and evaluate the performance of LLMs in the legal field. ### Significance The launch of LEGAL BENCH is of great significance for multiple fields: - **Legal Practitioners** can evaluate the applicability of LLMs in actual work processes, thereby improving the quality of client service. - **Legal Scholars** can explore new empirical research directions by observing the annotation capabilities of LLMs. - **Computer Scientists** can study the performance of LLMs in the legal field and discover new insights and technical challenges. In summary, the main purpose of this paper is to help relevant stakeholders gain a deeper understanding of the potential and limitations of LLMs in legal reasoning by constructing LEGAL BENCH, thereby ensuring the safe and ethical application of these technologies.