Abstract:As an important component of intelligent legal systems, legal case retrieval plays a critical role in ensuring judicial justice and fairness. However, the development of legal case retrieval technologies in the Chinese legal system is restricted by three problems in existing datasets: limited data size, narrow definitions of legal relevance, and naive candidate pooling strategies used in data sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale Legal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents. To the best of our knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval datasets, providing extensive coverage of criminal charges. Additionally, we enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure. This comprehensive criteria enriches the dataset and may provides a more holistic perspective. Furthermore, we propose a two-level candidate set pooling strategy that effectively identify potential candidates for each query case. It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law. Their expertise ensures the accuracy and reliability of the annotations. We evaluate several state-of-the-art retrieval models at LeCaRDv2, demonstrating that there is still significant room for improvement in legal case retrieval. The details of LeCaRDv2 can be found at the anonymous website <a class="link-external link-https" href="https://github.com/anonymous1113243/LeCaRDv2" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are three main issues in legal case retrieval techniques in China's legal system: 1. **Limited data volume**: Existing legal case retrieval data sets such as LeCaRDv1 only contain a small number of query cases and annotated case documents, which may not be sufficient to train large - language models and provide reliable evaluation results. Specifically, LeCaRDv1 has only 10,700 candidate cases and 107 query cases, covering 20 types of crimes. This small - scale data set limits its application scope. 2. **Narrow definition of legal relevance**: The relevance criteria of LeCaRDv1 only focus on the factual description part of the case, ignoring the similarity of penalties and procedures. When creating a high - quality data set, the relevance criteria are a fundamental issue, especially in the legal field. Generally, relevance in the legal field is different from general text similarity and is not limited to topic relevance. Although LeCaRDv1 has proposed new criteria to guide experts in determining relevance, it only focuses on the factual part, which may lead to a partial understanding of the relevance of the results and biased annotation. 3. **Simple candidate pool strategy**: LeCaRDv1 uses three retrieval models (TF - IDF, BM25, and LMIR) to construct a pool of 100 cases for each query. These methods mainly rely on vocabulary matching and have similar characteristics, so they may not be able to accurately identify potential cases for annotation purposes. To solve these problems, the author proposes LeCaRDv2, which is a large - scale Chinese legal case retrieval data set. LeCaRDv2 contains 800 query cases and 55,192 candidate cases, extracted from more than 4.3 million criminal case documents. Compared with LeCaRDv1, LeCaRDv2 has the following characteristics: - **Larger data scale**: The query cases in LeCaRDv2 cover 50 types of crimes, which can more comprehensively evaluate the effectiveness of retrieval models in the legal field. - **More comprehensive relevance criteria**: According to the official documents issued by the Supreme People's Court of China, LeCaRDv2 proposes new relevance criteria, involving three aspects: qualitative, penalty, and procedure, providing a more comprehensive perspective on relevance. - **Two - layer candidate pool strategy**: In order to identify potential cases with diverse characteristics, LeCaRDv2 proposes a two - layer candidate pool strategy, including a retrieval pool step and a ranking pool step. In the retrieval pool step, sparse vocabulary matching, dense semantic retrieval, and legal article similarity are combined to increase the diversity of the candidate set. In the ranking pool step, the running results submitted by CAIL2021 participants are used to further prioritize the cases in the retrieval pool to identify the most likely cases for annotation. Through these improvements, LeCaRDv2 aims to become a reliable data benchmark and promote the research and development in the field of legal case retrieval.

LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset

LeCaRD: A Legal Case Retrieval Dataset for Chinese Law System

Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs

Iterative Self-Supervised Learning for Legal Similar Case Retrieval

CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

LeDQA: A Chinese Legal Case Document-based Question Answering Dataset

Legal Case Retrieval: A Survey of the State of the Art

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

Result Diversification for Legal Case Retrieval

LEEC: A Legal Element Extraction Dataset with an Extensive Domain-Specific Label System

CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction.

LEVEN: A Large-Scale Chinese Legal Event Detection Dataset

STARD: A Chinese Statute Retrieval Dataset with Real Queries Issued by Non-professionals

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

BERT_LF: A Similar Case Retrieval Method Based on Legal Facts

Diverse legal case search

LePaRD: A Large-Scale Dataset of Judges Citing Precedents

STARD: A Chinese Statute Retrieval Dataset Derived from Real-life Queries by Non-professionals

MUSER: A Multi-View Similar Case Retrieval Dataset

Boosting legal case retrieval by query content selection with large language models

Incorporating Structural Information into Legal Case Retrieval