C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models

Jiahuan Cao,Yongxin Shi,Dezhi Peng,Yang Liu,Lianwen Jin
2024-05-30
Abstract:Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C$^{3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C$^{3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{<a class="link-external link-https" href="https://github.com/SCUT-DLVCLab/C3bench" rel="external noopener nofollow">this https URL</a>}.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of the current lack of a comprehensive benchmark for evaluating the classical Chinese understanding capabilities of large language models (LLMs). Specifically, although researchers have begun to explore the potential of LLMs in classical Chinese understanding (CCU) tasks in recent years, research in this area has been limited due to the absence of a standard CCU benchmark. To fill this gap, the paper proposes C3bench, a comprehensive classical Chinese understanding benchmark comprising 50,000 text pairs, covering five major CCU tasks: classification, retrieval, named entity recognition, punctuation restoration, and translation. The data for C3bench is sourced from ten different domains, nearly encompassing all categories of classical Chinese. Through this benchmark, the paper extensively evaluates the quantitative performance of 15 representative large language models across all five CCU tasks and establishes a public leaderboard for these models. The evaluation results not only reveal that existing large language models still perform worse on CCU tasks compared to supervised models but also highlight that CCU is a task requiring special attention. This study provides a standard benchmark, comprehensive baselines, and valuable insights for CCU research based on large language models.