Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Sarik Ghazarian,Ralph Weischedel,Aram Galstyan,Nanyun Peng
DOI: https://doi.org/10.48550/arXiv.1911.01456
2020-01-25
Abstract:User engagement is a critical metric for evaluating the quality of open-domain dialogue systems. Prior work has focused on conversation-level engagement by using heuristically constructed features such as the number of turns and the total time of the conversation. In this paper, we investigate the possibility and efficacy of estimating utterance-level engagement and define a novel metric, {\em predictive engagement}, for automatic evaluation of open-domain dialogue systems. Our experiments demonstrate that (1) human annotators have high agreement on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance-level engagement scores. Furthermore, we show that the utterance-level engagement scores can be learned from data. These scores can improve automatic evaluation metrics for open-domain dialogue systems, as shown by correlation with human judgements. This suggests that predictive engagement can be used as a real-time feedback for training better dialogue models.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the evaluation of open - domain dialogue systems, especially how to effectively and automatically evaluate these systems. Specifically, the paper focuses on user engagement, which is a key indicator for measuring the quality of open - domain dialogue systems. Traditional methods mainly focus on the evaluation of engagement at the conversation level, using some heuristically constructed features, such as the number of conversation turns and the total conversation time. However, these methods have certain limitations. For example, they cannot provide real - time feedback and have poor adaptability to different domains. For this reason, the author proposes a new metric - predictive engagement, aiming to estimate the engagement at the individual utterance level and use it as a new method for the automatic evaluation of open - domain dialogue systems. The main contributions of the paper include: 1. **Demonstrating the Feasibility of Utterance - level Engagement**: Experiments show that human annotators have a high degree of consensus when evaluating the engagement scores of individual query - response pairs. This indicates that the utterance - level engagement scores can be used for the real - time evaluation of dialogue systems, not just the evaluation after the end of the conversation. In addition, these scores can also be used to improve the training of dialogue models. 2. **Studying the Relationship between Utterance - level and Conversation - level Engagement**: It is found that there is a high correlation between the conversation - level engagement scores and the aggregated values of individual utterance engagement scores. This means that the conversation - level engagement scores can be assigned to all utterances in the same conversation, so as to use the existing conversation - level engagement resources to learn the utterance - level engagement scores. 3. **Proposing a Transfer Learning Framework**: Use the existing conversation - level engagement annotation resources, combined with a small amount of additional manually annotated data, to build an accurate utterance - level engagement scorer. 4. **Improving the Accuracy of Automatic Evaluation Systems**: Incorporating the predicted utterance - level engagement scores into the existing automatic evaluation metrics can significantly improve the correlation between these metrics and human judgments. Through these contributions, the paper aims to provide a more effective and accurate method for evaluating open - domain dialogue systems, so as to better reflect the actual experience and preferences of users.