Abstract:User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, {\em predictive engagement}, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can improve automatic evaluation metrics for open-domain dialogue systems, as shown by correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the evaluation of open - domain dialogue systems, especially how to effectively and automatically evaluate these systems. Specifically, the paper focuses on user engagement, which is a key indicator for measuring the quality of open - domain dialogue systems. Traditional methods mainly focus on the evaluation of engagement at the conversation level, using some heuristically constructed features, such as the number of conversation turns and the total conversation time. However, these methods have certain limitations. For example, they cannot provide real - time feedback and have poor adaptability to different domains. For this reason, the author proposes a new metric - predictive engagement, aiming to estimate the engagement at the individual utterance level and use it as a new method for the automatic evaluation of open - domain dialogue systems. The main contributions of the paper include: 1. **Demonstrating the Feasibility of Utterance - level Engagement**: Experiments show that human annotators have a high degree of consensus when evaluating the engagement scores of individual query - response pairs. This indicates that the utterance - level engagement scores can be used for the real - time evaluation of dialogue systems, not just the evaluation after the end of the conversation. In addition, these scores can also be used to improve the training of dialogue models. 2. **Studying the Relationship between Utterance - level and Conversation - level Engagement**: It is found that there is a high correlation between the conversation - level engagement scores and the aggregated values of individual utterance engagement scores. This means that the conversation - level engagement scores can be assigned to all utterances in the same conversation, so as to use the existing conversation - level engagement resources to learn the utterance - level engagement scores. 3. **Proposing a Transfer Learning Framework**: Use the existing conversation - level engagement annotation resources, combined with a small amount of additional manually annotated data, to build an accurate utterance - level engagement scorer. 4. **Improving the Accuracy of Automatic Evaluation Systems**: Incorporating the predicted utterance - level engagement scores into the existing automatic evaluation metrics can significantly improve the correlation between these metrics and human judgments. Through these contributions, the paper aims to provide a more effective and accurate method for evaluating open - domain dialogue systems, so as to better reflect the actual experience and preferences of users.

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Predicting User Engagement Status for Online Evaluation of Intelligent Assistants

User Response and Sentiment Prediction for Automatic Dialogue Evaluation

REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

Enhancing the Open-Domain Dialogue Evaluation in Latent Space

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

A multimodal approach for modeling engagement in conversation

Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems

Offline and Online Satisfaction Prediction in Open-Domain Conversational Systems

On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems

EMP-EVAL: A Framework for Measuring Empathy in Open Domain Dialogues

FFAEval: Evaluating Dialogue System Via Free-For-All Ranking

Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation

Approximating Online Human Evaluation of Social Chatbots with Prompting

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems