LawBench: Benchmarking Legal Knowledge of Large Language Models

Zhiwei Fei,Xiaoyu Shen,Dawei Zhu,Fengzhe Zhou,Zhuo Han,Songyang Zhang,Kai Chen,Zongwen Shen,Jidong Ge
DOI: https://doi.org/10.48550/arXiv.2309.16289
2023-09-28
Abstract:Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in <a class="link-external link-https" href="https://github.com/open-compass/LawBench/" rel="external noopener nofollow">this https URL</a>. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the knowledge level and task execution capabilities of large language models (LLMs) in the legal domain. Specifically, the authors propose a comprehensive evaluation benchmark, LawBench, aimed at precisely assessing the legal capabilities of LLMs from three cognitive levels: legal knowledge memory, understanding, and application. LawBench includes 20 diverse tasks covering five types of tasks: single-label classification, multi-label classification, regression, extraction, and generation. ### Background and Motivation 1. **Specialty of the Legal Domain**: Legal tasks involve highly specialized texts that require understanding complex legal concepts and provisions. Currently, these tasks are mainly performed by legal experts who have undergone years of specialized training. 2. **Limitations of Existing Evaluations**: Existing evaluation benchmarks mainly focus on general capabilities and specific exams (such as the bar exam), but these evaluations do not necessarily reflect the performance of LLMs in actual legal tasks. 3. **Improving Legal Efficiency and Accessibility**: Equipping LLMs with legal expertise can not only enhance the efficiency of legal professionals but also meet the substantial demand for legal assistance from non-professionals, thereby improving public access to justice. ### Solution 1. **Design of LawBench**: - **Task Types**: Covers five types of tasks: single-label classification, multi-label classification, regression, extraction, and generation. - **Cognitive Levels**: - **Legal Knowledge Memory**: Evaluates whether LLMs can remember necessary legal concepts, provisions, and facts. - **Legal Knowledge Understanding**: Evaluates whether LLMs can understand entities, events, and relationships in legal texts. - **Legal Knowledge Application**: Evaluates whether LLMs can reasonably utilize their legal knowledge to perform necessary reasoning steps to solve actual legal tasks. 2. **Data Sources**: Task data comes from multiple public datasets, including legal judgments, consultation questions, news reports, etc. 3. **Evaluation Method**: Conducted extensive evaluations on 51 popular LLMs, including 20 multilingual models, 22 Chinese models, and 9 legal-specific models. Designed rules, regular expressions, and metrics suitable for each task to effectively extract answers. ### Key Findings 1. **GPT-4 Performs Best**: In the legal domain, GPT-4 significantly outperforms other models. 2. **Impact of Fine-Tuning**: Although fine-tuning on legal texts can bring some improvements, there is still a significant gap between legal-specific models and general LLMs. 3. **Future Directions**: Important recommendations are proposed to better guide the future development of legal LLMs for the Chinese legal community. Through LawBench, the authors hope to provide a structured and comprehensive evaluation framework for LLM research in the legal domain, promoting further development in this field.