Abstract:Effective evaluation methods remain a significant challenge for research on open-domain conversational dialogue systems. Explicit satisfaction ratings can be elicited from users, but users often do not provide ratings when asked, and those they give can be highly subjective. Post-hoc ratings by experts are an alternative, but these can be both expensive and complex to collect. Here, we explore the creation of automated methods for predicting both expert and user ratings of open-domain dialogues. We compare four different approaches. First, we train a baseline model using an end-to-end transformer to predict ratings directly from the raw dialogue text. The other three methods are variants of a two-stage approach in which we first extract interpretable features at the turn level that capture, among other aspects, user dialogue behaviors indicating contradiction, repetition, disinterest, compliments, or criticism. We project these features to the dialogue level and train a dialogue-level MLP regression model, a dialogue-level LSTM, and a novel causal inference model called counterfactual-LSTM (CF-LSTM) to predict ratings. The proposed CF-LSTM is a sequential model over turn-level features which predicts ratings using multiple regressors depending on hypotheses derived from the turn-level features. As a causal inference model, CF-LSTM aims to learn the underlying causes of a specific event, such as a low rating. We also bin the user ratings and perform classification experiments with all four models. In evaluation experiments on conversational data from the Alexa Prize SocialBot, we show that the CF-LSTM achieves the best performance for predicting dialogue ratings and classification.

User Response and Sentiment Prediction for Automatic Dialogue Evaluation

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Predicting Users' Negative Feedbacks in Multi-Turn Human-Computer Dialogues.

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

Emotion Analysis for the Upcoming Response in Open-Domain Human-Computer Conversation.

Generating Empathetic Responses by Looking Ahead the User's Sentiment

Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

Socio-Emotional Response Generation: A Human Evaluation Protocol for LLM-Based Conversational Systems

Sentiment Analysis for Open Domain Conversational Agent

Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Understanding User Satisfaction with Task-oriented Dialogue Systems

Offline and Online Satisfaction Prediction in Open-Domain Conversational Systems

Understanding and Predicting User Satisfaction with Conversational Recommender Systems

Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis

Speech Sentiment and Customer Satisfaction Estimation in Socialbot Conversations

Evaluating Open-Domain Dialogues in Latent Space with Next Sentence Prediction and Mutual Information.

Improving Open-Domain Dialogue Evaluation with a Causal Inference Model