Abstract:Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at <a class="link-external link-https" href="https://github.com/amphionspace/SD-Eval" rel="external noopener nofollow">this https URL</a>.

SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation.

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

FFAEval: Evaluating Dialogue System Via Free-For-All Ranking

DynaEval: Unifying Turn and Dialogue Level Evaluation

Enhancing the Open-Domain Dialogue Evaluation in Latent Space

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation

How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning

xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark

A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

Learning Dialogue Representations from Consecutive Utterances

SELU: Self-Learning Embodied MLLMs in Unknown Environments

FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows

How to Evaluate Your Dialogue Models: A Review of Approaches

FCM: A Fine-grained Comparison Model for Multi-turn Dialogue Reasoning

MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue

FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation

GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems