Abstract:In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.

Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation

How to Evaluate Your Dialogue Models: A Review of Approaches

Joint System-Wise Optimization for Pipeline Goal-Oriented Dialog System

Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems

An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue Systems

Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning

Turn-level Dialog Evaluation with Dialog-level Weak Signals for Bot-Human Hybrid Customer Service Systems

CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Many Hands Make Light Work: Task-Oriented Dialogue System with Module-Based Mixture-of-Experts

Recent Advances and Challenges in Task-Oriented Dialog Systems

Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations

On Evaluating and Comparing Open Domain Dialog Systems

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Multi-dimensional Evaluation of Empathetic Dialog Responses