Abstract:Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at <a class="link-external link-https" href="https://github.com/amphionspace/SD-Eval" rel="external noopener nofollow">this https URL</a>.

"How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations

Towards Generalized Models for Task-oriented Dialogue Modeling on Spoken Conversations

Dialogos: a robust system for human-machine spoken dialogue on the telephone

Adapting Text-based Dialogue State Tracker for Spoken Dialogues

Are LLMs Robust for Spoken Dialogues?

KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark

RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems

Robust Speech Recognition Directed by Extended Template Matching in Dialogue System

Improving Robustness of Task Oriented Dialog Systems

Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems

Robust Analysis And Interpretation Of Spoken Chinese Queries

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Key-Value Retrieval Networks for Task-Oriented Dialogue

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

Many Hands Make Light Work: Task-Oriented Dialogue System with Module-Based Mixture-of-Experts

Using Deep-Q Network To Select Candidates From N-Best Speech Recognition Hypotheses For Enhancing Dialogue State Tracking

Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset

More Robust Schema-Guided Dialogue State Tracking via Tree-Based Paraphrase Ranking

Statistical Methods for Building Robust Spoken Dialogue Systems in an Automobile