Abstract:Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at <a class="link-external link-https" href="https://github.com/AIFlames/Esc-Eval" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the evaluation issues of Large Language Models (LLMs) in Emotion Support Conversation (ESC). Although many researchers have utilized LLMs as ESC models, the evaluation methods for these models remain uncertain. Existing evaluation methods mainly rely on text-based statistical metrics or manual evaluation, which have limitations in objectivity and multi-turn dialogue capability assessment. To solve these problems, the authors propose a new evaluation framework—ESC-Eval. This framework interacts with ESC models through role-playing agents and conducts manual evaluations of the generated multi-turn dialogues, providing a more comprehensive and efficient evaluation method. Additionally, to achieve automated evaluation of future ESC models, the authors developed a scoring model—ESC-RANK, which surpasses GPT-4 by 35 points in accuracy. ### Main Contributions 1. **Proposing the ESC-Eval Framework**: This is the first framework to evaluate LLM-based ESC models through role-playing. It includes 2801 user cards with fine-grained information, a dedicated role-playing model (ESC-Role), and seven carefully designed evaluation dimensions. 2. **Evaluating 14 LLMs**: Using the ESC-Eval framework, the authors tested 14 LLMs and conducted manual annotations based on the carefully designed evaluation dimensions. The study results highlight the urgent need for ESC models with superior human preference and strong emotional support knowledge. 3. **Developing ESC-RANK**: To achieve automated evaluation of future ESC models, the authors developed a scoring model ESC-RANK, which significantly outperforms GPT-4 by 35 points in accuracy. ### Method Overview 1. **Role Card Acquisition**: - Constructed a three-level classification system with 37 categories. - Extracted role cards from 7 open-source datasets and ensured quality through GPT-4 and manual filtering. - Obtained 2801 high-quality role cards through crowdsourced annotation and manual correction. 2. **ESC-Role Model Training**: - Collected ESC scenario data using Smile, ESConv, and ExTES datasets. - Extracted and filtered role cards through GPT-4, ultimately obtaining 14K role-playing data. - Developed the ESC-Role model through efficient LoRA parameter fine-tuning based on the Qwen1.5-14B-Chat model. 3. **Evaluation Results**: - Validated the effectiveness of the ESC-Role model through manual evaluation and pairwise comparison. - Evaluation results show that ESC-Role performs excellently across multiple dimensions, especially in specific emotional support metrics. 4. **ESC-RANK Model**: - Trained the ESC-RANK model using manually annotated data. - Experimental results indicate that ESC-RANK significantly outperforms GPT-4 in accuracy. ### Conclusion By proposing the ESC-Eval framework and the ESC-RANK model, this paper provides a new, more comprehensive method for evaluating LLM-based ESC models. These methods not only improve the objectivity and efficiency of evaluations but also provide strong support for future automated evaluations.

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models

AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation

Emotional intelligence of Large Language Models

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Building Emotional Support Chatbots in the Era of LLMs

GameEval: Evaluating LLMs on Conversational Games

EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

EmoBench: Evaluating the Emotional Intelligence of Large Language Models

Large Language Models Understand and Can be Enhanced by Emotional Stimuli

SweetieChat: A Strategy-Enhanced Role-playing Framework for Diverse Scenarios Handling Emotional Support Agent

Improving Multi-turn Emotional Support Dialogue Generation with Lookahead Strategy Planning

Investigating Large Language Models' Perception of Emotion Using Appraisal Theory

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Recent Advancement of Emotion Cognition in Large Language Models

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Harnessing Large Language Models' Empathetic Response Generation Capabilities for Online Mental Health Counselling Support