Abstract:Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at <a class="link-external link-https" href="https://github.com/amphionspace/SD-Eval" rel="external noopener nofollow">this https URL</a>.

Enhancing the Open-Domain Dialogue Evaluation in Latent Space

Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation

SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Multi-dimensional Evaluation of Empathetic Dialog Responses

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

FFAEval: Evaluating Dialogue System Via Free-For-All Ranking

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

Improving Open-Domain Dialogue Evaluation with a Causal Inference Model

xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark

MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation

REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems

Large Language Model based Situational Dialogues for Second Language Learning

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison

Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs