Abstract:Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at <a class="link-external link-https" href="https://github.com/amphionspace/SD-Eval" rel="external noopener nofollow">this https URL</a>.

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

MLS: A Large-Scale Multilingual Dataset for Speech Research

LLaSM: Large Language and Speech Model

The Massively Multilingual Natural Language Understanding 2022 (MMNLU-22) Workshop and Competition

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

SpeechVerse: A Large-scale Generalizable Audio Language Model

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

Scaling Speech Technology to 1,000+ Languages

MaLA-500: Massive Language Adaptation of Large Language Models

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

AudioPaLM: A Large Language Model That Can Speak and Listen

Towards Robust Speech Representation Learning for Thousands of Languages

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners