CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Quan Tu,Shilong Fan,Zihang Tian,Rui Yan
2024-01-10
Abstract:Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the current lack of a comprehensive benchmark for evaluating the capabilities of role - playing conversation agents (RPCA). Specifically, existing datasets are either generated by large - language models (LLM) and have quality problems, or are extracted from existing texts but contain a large amount of noise, resulting in unreliable evaluation results. To solve these problems, the author proposes CharacterEval, which is a benchmark for evaluating Chinese role - playing conversation agents. ### Summary of Main Problems: 1. **Lack of a comprehensive evaluation benchmark**: Existing evaluation methods and datasets cannot comprehensively and accurately evaluate the performance of role - playing conversation agents. 2. **Data quality problems**: The quality of existing datasets varies greatly. Many are generated by LLM or introduce a large amount of noise when extracted from texts. 3. **Insufficient evaluation dimensions**: A multi - dimensional evaluation framework is required to comprehensively measure the performance of RPCA, including conversational ability, character consistency, role - playing attractiveness, and personality back - testing. ### Specific Contributions of CharacterEval: - **Constructing a high - quality dataset**: CharacterEval contains 1,785 multi - round role - playing conversations, covering 11,376 examples and 77 main characters from multiple Chinese novels and plays. - **Proposing a multi - dimensional evaluation framework**: CharacterEval contains four - dimensional evaluation indicators, a total of thirteen specific indicators, ensuring a comprehensive evaluation of RPCA. - Conversational Ability - Character Consistency - Role - playing Attractiveness - Personality Back - Testing - **Developing a role - playing reward model (CharacterRM)**: A role - playing reward model has been developed based on human annotations, and this model is superior to GPT - 4 in terms of correlation with human judgment. - **Extensive experimental evaluation**: A comprehensive evaluation has been carried out on a variety of existing LLM, including open - source and closed - source models, verifying the effectiveness and applicability of CharacterEval. Through these contributions, CharacterEval aims to promote the research and development of role - playing conversation agents and provide a more reliable and comprehensive evaluation tool.