Abstract:Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in <a class="link-external link-https" href="https://github.com/open-compass/LawBench/" rel="external noopener nofollow">this https URL</a>. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain.

Empirical Study of LLM Fine-Tuning for Text Classification in Legal Document Review

Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model

Lawma: The Power of Specialization for Legal Tasks

Fine-tuning and Application of Large Language Model in Law Domain

Fine-tuning and Utilization Methods of Domain-specific LLMs

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services

A Comparative Analysis of Instruction Fine-Tuning LLMs for Financial Text Classification

The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities

Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning

Fine Tuning LLM for Enterprise: Practical Guidelines and Recommendations

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

LawBench: Benchmarking Legal Knowledge of Large Language Models

Optimizing Numerical Estimation and Operational Efficiency in the Legal Domain through Large Language Models

Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs

TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

Large Language Models are legal but they are not: Making the case for a powerful LegalLLM

On the Effectiveness of Pre-Trained Language Models for Legal Natural Language Processing: An Empirical Study