Abstract:With the increasing popularity of conversational search, how to evaluate the performance of conversational search systems has become an important question in the IR community. Existing works on conversational search evaluation can mainly be categorized into two streams: (1) constructing metrics based on semantic similarity (e.g. BLUE, METEOR and BERTScore), or (2) directly evaluating the response ranking performance of the system using traditional search methods (e.g. nDCG, RBP and nERR). However, these methods either ignore the information need of the user or ignore the mixed-initiative property of conversational search. This raises the question of how to accurately model user satisfaction in conversational search scenarios. Since explicitly asking users to provide satisfaction feedback is difficult, traditional IR studies often rely on the Cranfield paradigm (i.e., third-party annotation) and user behavior modeling to estimate user satisfaction in search. However, the feasibility and effectiveness of these two approaches have not been fully explored in conversational search. In this paper, we dive into the evaluation of conversational search from the perspective of user satisfaction. We build a novel conversational search experimental platform and construct a Chinese open-domain conversational search behavior dataset containing rich annotations and search behavior data. We also collect third-party satisfaction annotation at the session-level and turn-level, to investigate the feasibility of the Cranfield paradigm in the conversational search scenario. Experimental results show both some consistency and considerable differences between the user satisfaction annotations and third-party annotations. We also propose dialog continuation or ending behavior models (DCEBM) to capture session-level user satisfaction based on turn-level information.

Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues.

Towards Automatic Evaluation of Customer-Helpdesk Dialogues

Short Text Conversation Based on Deep Neural Network and Analysis on Evaluation Measures

Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation

The First Evaluation of Chinese Human-Computer Dialogue Technology

Towards Better Understanding of User Satisfaction in Open-Domain Conversational Search

An Evaluation of Chinese Human-Computer Dialogue Technology

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs

ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog

CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

How to Evaluate Your Dialogue Models: A Review of Approaches

DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue Assessment

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback

A Manually Annotated Chinese Corpus for Non-task-oriented Dialogue Systems

CNIMA: A Universal Evaluation Framework and Automated Approach for Assessing Second Language Dialogues

DialogQAE: N-to-N Question Answer Pair Extraction from Customer Service Chatlog.

Matching Questions and Answers in Dialogues from Online Forums

Microsoft Dialogue Challenge: Building End-to-End Task-Completion Dialogue Systems

Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems