Abstract:Being able to reply with a related, fluent, and informative response is an indispensable requirement for building high-quality conversational agents. In order to generate better responses, some approaches have been proposed, such as feeding extra information by collecting large-scale datasets with human annotations, designing neural conversational models (NCMs) with complex architecture and loss functions, or filtering out untrustworthy samples based on a dialogue attribute, e.g., Relatedness or Genericness. In this paper, we follow the third research branch and present a data filtering method for open-domain dialogues, which identifies untrustworthy samples from training data with a quality measure that linearly combines seven dialogue attributes. The attribute weights are obtained via Bayesian Optimization (BayesOpt) that aims to optimize an objective function for dialogue generation iteratively on the validation set. Then we score training samples with the quality measure, sort them in descending order, and filter out those at the bottom. Furthermore, to accelerate the "filter-train-evaluate" iterations involved in BayesOpt on large-scale datasets, we propose a training framework that integrates maximum likelihood estimation (MLE) and negative training method (NEG). The training method updates parameters of a trained NCMs on two small sets with newly maintained and removed samples, respectively. Specifically, MLE is applied to maximize the log-likelihood of newly maintained samples, while NEG is used to minimize the log-likelihood of newly removed ones. Experimental results on two datasets show that our method can effectively identify untrustworthy samples, and NCMs trained on the filtered datasets achieve better performance.

The Lab vs The Crowd: An Investigation into Data Quality for Neural Dialogue Models

DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit

Improving Dialogue Management: Quality Datasets vs Models

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Real User Evaluation of Spoken Dialogue Systems Using Amazon Mechanical Turk.

Effects of Naturalistic Variation in Goal-Oriented Dialog

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare

Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Data Quality in Crowdsourcing and Spamming Behavior Detection

Data-Driven Dialogue Systems for Social Agents

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

What Went Wrong? Explaining Overall Dialogue Quality through Utterance-Level Impacts

Leveraging LLMs for Dialogue Quality Measurement

Identifying Untrustworthy Samples: Data Filtering for Open-domain Dialogues with Bayesian Optimization

When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona Dialogue Corpus

Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs

"How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations

Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents