LegalAgentBench: Evaluating LLM Agents in Legal Domain
Haitao Li,Junjie Chen,Jingli Yang,Qingyao Ai,Wei Jia,Youfeng Liu,Kai Lin,Yueyue Wu,Guozhi Yuan,Yiran Hu,Wuyue Wang,Yiqun Liu,Minlie Huang
2024-12-23
Abstract:With the increasing intelligence and autonomy of LLM agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks cannot fully capture the complexity and subtle nuances of real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. We designed a scalable task construction framework and carefully annotated 300 tasks. These tasks span various types, including multi-hop reasoning and writing, and range across different difficulty levels, effectively reflecting the complexity of real-world legal scenarios. Moreover, beyond evaluating final success, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, enabling more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at \url{<a class="link-external link-https" href="https://github.com/CSHaitao/LegalAgentBench" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Information Retrieval