CERD: A Comprehensive Chinese Rhetoric Dataset for Rhetorical Understanding and Generation in Essays

Nuowei Liu,Xinhao Chen,Hongyi Wu,Changzhi Sun,Man Lan,Yuanbin Wu,Xiaopeng Bai,Shaoguang Mao,Yan Xia
2024-09-29
Abstract:Existing rhetorical understanding and generation datasets or corpora primarily focus on single coarse-grained categories or fine-grained categories, neglecting the common interrelations between different rhetorical devices by treating them as independent sub-tasks. In this paper, we propose the Chinese Essay Rhetoric Dataset (CERD), consisting of 4 commonly used coarse-grained categories including metaphor, personification, hyperbole and parallelism and 23 fine-grained categories across both form and content levels. CERD is a manually annotated and comprehensive Chinese rhetoric dataset with five interrelated sub-tasks. Unlike previous work, our dataset aids in understanding various rhetorical devices, recognizing corresponding rhetorical components, and generating rhetorical sentences under given conditions, thereby improving the author's writing proficiency and language usage skills. Extensive experiments are conducted to demonstrate the interrelations between multiple tasks in CERD, as well as to establish a benchmark for future research on rhetoric. The experimental results indicate that Large Language Models achieve the best performance across most tasks, and jointly fine-tuning with multiple tasks further enhances performance.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that the existing rhetoric understanding and generation datasets or corpora mainly focus on single coarse - grained or fine - grained categories, ignoring the internal connections between different rhetorical devices. Specifically, the existing datasets usually regard different rhetorical devices as independent subtasks, which leads to a limited and one - sided understanding of rhetorical phenomena. To solve this problem, the author proposes a comprehensive Chinese rhetoric dataset (Chinese Essay Rhetoric Dataset, CERD), which contains four commonly - used coarse - grained categories (simile, personification, hyperbole, parallelism) and 23 fine - grained categories, covering both form and content levels. The characteristics of the CERD dataset are as follows: 1. **Multi - task framework**: CERD contains five inter - related subtasks, which cover different aspects of rhetoric understanding and generation, including Rhetoric Classification (RC), Form Classification (FC), Content Classification (CC), Component Extraction (CE) and Rhetoric Generation (RG). 2. **Comprehensive annotation**: Each composition in the dataset is manually annotated, and the annotation results are presented at the sentence level, except for the Rhetoric Generation task (RG). 3. **Rich rhetorical categories**: CERD not only includes common coarse - grained rhetorical categories, but also further subdivides fine - grained categories, providing a more in - depth perspective for understanding rhetoric. 4. **Internal connection**: CERD emphasizes the internal connection between different rhetorical devices, and shows these connections through a multi - task framework, which is helpful for a more comprehensive understanding of rhetorical phenomena. Through these designs, CERD aims to provide a more comprehensive and in - depth benchmark for rhetoric understanding and generation, establish standards for future rhetoric research, and improve the writing ability and language application skills of authors. The experimental results show that large - language models (LLMs) perform well on most tasks, and multi - task joint fine - tuning further improves the performance.