C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models

Jiahuan Cao,Yongxin Shi,Dezhi Peng,Yang Liu,Lianwen Jin

2024-05-30

Abstract:Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C$^{3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C$^{3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{<a class="link-external link-https" href="https://github.com/SCUT-DLVCLab/C3bench" rel="external noopener nofollow">this https URL</a>}.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issue of the current lack of a comprehensive benchmark for evaluating the classical Chinese understanding capabilities of large language models (LLMs). Specifically, although researchers have begun to explore the potential of LLMs in classical Chinese understanding (CCU) tasks in recent years, research in this area has been limited due to the absence of a standard CCU benchmark. To fill this gap, the paper proposes C3bench, a comprehensive classical Chinese understanding benchmark comprising 50,000 text pairs, covering five major CCU tasks: classification, retrieval, named entity recognition, punctuation restoration, and translation. The data for C3bench is sourced from ten different domains, nearly encompassing all categories of classical Chinese. Through this benchmark, the paper extensively evaluates the quantitative performance of 15 representative large language models across all five CCU tasks and establishes a public leaderboard for these models. The evaluation results not only reveal that existing large language models still perform worse on CCU tasks compared to supervised models but also highlight that CCU is a task requiring special attention. This study provides a standard benchmark, comprehensive baselines, and valuable insights for CCU research based on large language models.

C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models

C^3Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

CMMLU: Measuring massive multitask language understanding in Chinese

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

AlignBench: Benchmarking Chinese Alignment of Large Language Models

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models

Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

CMB: A Comprehensive Medical Benchmark in Chinese

CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

An Improved Traditional Chinese Evaluation Suite for Foundation Model