Abstract:Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in <a class="link-external link-https" href="https://github.com/open-compass/LawBench/" rel="external noopener nofollow">this https URL</a>. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain.

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating the knowledge level and task execution capabilities of large language models (LLMs) in the legal domain. Specifically, the authors propose a comprehensive evaluation benchmark, LawBench, aimed at precisely assessing the legal capabilities of LLMs from three cognitive levels: legal knowledge memory, understanding, and application. LawBench includes 20 diverse tasks covering five types of tasks: single-label classification, multi-label classification, regression, extraction, and generation. ### Background and Motivation 1. **Specialty of the Legal Domain**: Legal tasks involve highly specialized texts that require understanding complex legal concepts and provisions. Currently, these tasks are mainly performed by legal experts who have undergone years of specialized training. 2. **Limitations of Existing Evaluations**: Existing evaluation benchmarks mainly focus on general capabilities and specific exams (such as the bar exam), but these evaluations do not necessarily reflect the performance of LLMs in actual legal tasks. 3. **Improving Legal Efficiency and Accessibility**: Equipping LLMs with legal expertise can not only enhance the efficiency of legal professionals but also meet the substantial demand for legal assistance from non-professionals, thereby improving public access to justice. ### Solution 1. **Design of LawBench**: - **Task Types**: Covers five types of tasks: single-label classification, multi-label classification, regression, extraction, and generation. - **Cognitive Levels**: - **Legal Knowledge Memory**: Evaluates whether LLMs can remember necessary legal concepts, provisions, and facts. - **Legal Knowledge Understanding**: Evaluates whether LLMs can understand entities, events, and relationships in legal texts. - **Legal Knowledge Application**: Evaluates whether LLMs can reasonably utilize their legal knowledge to perform necessary reasoning steps to solve actual legal tasks. 2. **Data Sources**: Task data comes from multiple public datasets, including legal judgments, consultation questions, news reports, etc. 3. **Evaluation Method**: Conducted extensive evaluations on 51 popular LLMs, including 20 multilingual models, 22 Chinese models, and 9 legal-specific models. Designed rules, regular expressions, and metrics suitable for each task to effectively extract answers. ### Key Findings 1. **GPT-4 Performs Best**: In the legal domain, GPT-4 significantly outperforms other models. 2. **Impact of Fine-Tuning**: Although fine-tuning on legal texts can bring some improvements, there is still a significant gap between legal-specific models and general LLMs. 3. **Future Directions**: Important recommendations are proposed to better guide the future development of legal LLMs for the Chinese legal community. Through LawBench, the authors hope to provide a structured and comprehensive evaluation framework for LLM research in the legal domain, promoting further development in this field.

LawBench: Benchmarking Legal Knowledge of Large Language Models

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

LAiW: A Chinese Legal Large Language Models Benchmark

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

InternLM-Law: An Open Source Chinese Legal Large Language Model

One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Fine-tuning and Application of Large Language Model in Law Domain

LawLLM: Law Large Language Model for the US Legal System

LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model

Large Language Models are legal but they are not: Making the case for a powerful LegalLLM

Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models

Legal Evalutions and Challenges of Large Language Models

BLT: Can Large Language Models Handle Basic Legal Text?

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Exploring New Frontiers of Deep Learning in Legal Practice: A Case Study of Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

Lawma: The Power of Specialization for Legal Tasks

Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models

LeKUBE: A Legal Knowledge Update BEnchmark