LeDQA: A Chinese Legal Case Document-based Question Answering Dataset

Bulou Liu,Zhenhao Zhu,Qingyao Ai,Yiqun Liu,Yueyue Wu
DOI: https://doi.org/10.1145/3627673.3679154
2024-01-01
Abstract:Legal question answering based on case documents is a pivotal legal AI application and helps extract key elements from the legal case documents to promote downstream tasks. Intuitively, the form of this task is similar to legal machine reading comprehension. However, in existing legal machine reading comprehension datasets, the background information is much shorter than the legal case documents, and the questions are not designed from the perspective of legal knowledge. In this paper, we present LeDQA, the first Chinese legal case document-based question answering dataset to our best knowledge. Specifically, we build a comprehensive question schema (including 48 element-based questions) for the Chinese civil law by legal professionals. And considering the cost of human annotations are too expensive, we use one of the SOTA LLMs (i.e., GPT-4) to annotate the relevant sentences to these questions in each case document. The constructed dataset originates from Chinese civil cases and contains 100 case documents, 4,800 case-question pairs and 132,048 sentence-level relevance annotations. We implement several text matching algorithms for relevant sentence selection and various Large Language Models(LLMs) for legal question answering on LeDQA. The experimental results indicate that incorporating relevant sentences can benefit the performance of question answering models, but further efforts are still required to address the remaining challenges such as retrieving irrelevant sentences and incorrect reasoning between retrieved sentences.
What problem does this paper attempt to address?