Abstract:Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the current lack of a comprehensive benchmark for evaluating the capabilities of role - playing conversation agents (RPCA). Specifically, existing datasets are either generated by large - language models (LLM) and have quality problems, or are extracted from existing texts but contain a large amount of noise, resulting in unreliable evaluation results. To solve these problems, the author proposes CharacterEval, which is a benchmark for evaluating Chinese role - playing conversation agents. ### Summary of Main Problems: 1. **Lack of a comprehensive evaluation benchmark**: Existing evaluation methods and datasets cannot comprehensively and accurately evaluate the performance of role - playing conversation agents. 2. **Data quality problems**: The quality of existing datasets varies greatly. Many are generated by LLM or introduce a large amount of noise when extracted from texts. 3. **Insufficient evaluation dimensions**: A multi - dimensional evaluation framework is required to comprehensively measure the performance of RPCA, including conversational ability, character consistency, role - playing attractiveness, and personality back - testing. ### Specific Contributions of CharacterEval: - **Constructing a high - quality dataset**: CharacterEval contains 1,785 multi - round role - playing conversations, covering 11,376 examples and 77 main characters from multiple Chinese novels and plays. - **Proposing a multi - dimensional evaluation framework**: CharacterEval contains four - dimensional evaluation indicators, a total of thirteen specific indicators, ensuring a comprehensive evaluation of RPCA. - Conversational Ability - Character Consistency - Role - playing Attractiveness - Personality Back - Testing - **Developing a role - playing reward model (CharacterRM)**: A role - playing reward model has been developed based on human annotations, and this model is superior to GPT - 4 in terms of correlation with human judgment. - **Extensive experimental evaluation**: A comprehensive evaluation has been carried out on a variety of existing LLM, including open - source and closed - source models, verifying the effectiveness and applicability of CharacterEval. Through these contributions, CharacterEval aims to promote the research and development of role - playing conversation agents and provide a more reliable and comprehensive evaluation tool.

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models

A Multi-Task Role-Playing Agent Capable of Imitating Character Linguistic Styles

InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews

Capturing Minds, Not Just Words: Enhancing Role-Playing Language Models with Personality-Indicative Data

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Orca: Enhancing Role-Playing Abilities of Large Language Models by Integrating Personality Traits

MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents

Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works

ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning

Characteristic AI Agents via Large Language Models

Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment

Character-LLM: A Trainable Agent for Role-Playing

RoleCraft-GLM: Advancing Personalized Role-Playing in Large Language Models

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Neeko: Leveraging Dynamic LoRA for Efficient Multi-Character Role-Playing Agent

BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model