Abstract:Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at <a class="link-external link-https" href="https://github.com/amphionspace/SD-Eval" rel="external noopener nofollow">this https URL</a>.

Overview of the NLPCC 2023 Shared Task 10: Learn to Watch TV: Multimodal Dialogue Understanding and Response Generation.

Overview of the NLPCC 2022 Shared Task: Multi-modal Dialogue Understanding and Generation

JDDC 2.1: A Multimodal Chinese Dialogue Dataset with Joint Tasks of Query Rewriting, Response Generation, Discourse Parsing, and Summarization

Overview of the NLPCC 2022 Shared Task: Dialogue Text Analysis (DTA)

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

Overview of the NLPCC 2018 Shared Task: Social Media User Modeling

Microsoft Dialogue Challenge: Building End-to-End Task-Completion Dialogue Systems

XDailyDialog: A Multilingual Parallel Dialogue Corpus

Deep Contextualized Utterance Representations for Response Selection and Dialogue Analysis

CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue

Towards Generalized Models for Task-oriented Dialogue Modeling on Spoken Conversations

MultiWOZ 2.3: A Multi-Domain Task-Oriented Dialogue Dataset Enhanced with Annotation Corrections and Co-Reference Annotation

CNIMA: A Universal Evaluation Framework and Automated Approach for Assessing Second Language Dialogues

Overview of the NLPCC 2017 Shared Task: Emotion Generation Challenge

PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas

Overview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts

DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset