ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Haiquan Zhao,Lingyu Li,Shisong Chen,Shuqi Kong,Jiaan Wang,Kexin Huang,Tianle Gu,Yixu Wang,Wang Jian,Dandan Liang,Zhixu Li,Yan Teng,Yanghua Xiao,Yingchun Wang
2024-10-28
Abstract:Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at <a class="link-external link-https" href="https://github.com/AIFlames/Esc-Eval" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the evaluation issues of Large Language Models (LLMs) in Emotion Support Conversation (ESC). Although many researchers have utilized LLMs as ESC models, the evaluation methods for these models remain uncertain. Existing evaluation methods mainly rely on text-based statistical metrics or manual evaluation, which have limitations in objectivity and multi-turn dialogue capability assessment. To solve these problems, the authors propose a new evaluation framework—ESC-Eval. This framework interacts with ESC models through role-playing agents and conducts manual evaluations of the generated multi-turn dialogues, providing a more comprehensive and efficient evaluation method. Additionally, to achieve automated evaluation of future ESC models, the authors developed a scoring model—ESC-RANK, which surpasses GPT-4 by 35 points in accuracy. ### Main Contributions 1. **Proposing the ESC-Eval Framework**: This is the first framework to evaluate LLM-based ESC models through role-playing. It includes 2801 user cards with fine-grained information, a dedicated role-playing model (ESC-Role), and seven carefully designed evaluation dimensions. 2. **Evaluating 14 LLMs**: Using the ESC-Eval framework, the authors tested 14 LLMs and conducted manual annotations based on the carefully designed evaluation dimensions. The study results highlight the urgent need for ESC models with superior human preference and strong emotional support knowledge. 3. **Developing ESC-RANK**: To achieve automated evaluation of future ESC models, the authors developed a scoring model ESC-RANK, which significantly outperforms GPT-4 by 35 points in accuracy. ### Method Overview 1. **Role Card Acquisition**: - Constructed a three-level classification system with 37 categories. - Extracted role cards from 7 open-source datasets and ensured quality through GPT-4 and manual filtering. - Obtained 2801 high-quality role cards through crowdsourced annotation and manual correction. 2. **ESC-Role Model Training**: - Collected ESC scenario data using Smile, ESConv, and ExTES datasets. - Extracted and filtered role cards through GPT-4, ultimately obtaining 14K role-playing data. - Developed the ESC-Role model through efficient LoRA parameter fine-tuning based on the Qwen1.5-14B-Chat model. 3. **Evaluation Results**: - Validated the effectiveness of the ESC-Role model through manual evaluation and pairwise comparison. - Evaluation results show that ESC-Role performs excellently across multiple dimensions, especially in specific emotional support metrics. 4. **ESC-RANK Model**: - Trained the ESC-RANK model using manually annotated data. - Experimental results indicate that ESC-RANK significantly outperforms GPT-4 in accuracy. ### Conclusion By proposing the ESC-Eval framework and the ESC-RANK model, this paper provides a new, more comprehensive method for evaluating LLM-based ESC models. These methods not only improve the objectivity and efficiency of evaluations but also provide strong support for future automated evaluations.