Abstract:Recent advancements in reference-free learned metrics for open-domain dialogue evaluation have been driven by the progress in pre-trained language models and the availability of dialogue data with high-quality human annotations. However, current studies predominantly concentrate on English dialogues, and the generalization of these metrics to other languages has not been fully examined. This is largely due to the absence of a multilingual dialogue evaluation benchmark. To address the issue, we introduce xDial-Eval, built on top of open-source English dialogue evaluation datasets. xDial-Eval includes 12 turn-level and 6 dialogue-level English datasets, comprising 14930 annotated turns and 8691 annotated dialogues respectively. The English dialogue data are extended to nine other languages with commercial machine translation systems. On xDial-Eval, we conduct comprehensive analyses of previous BERT-based metrics and the recently-emerged large language models. Lastly, we establish strong self-supervised and multilingual baselines. In terms of average Pearson correlations over all datasets and languages, the best baseline outperforms OpenAI's ChatGPT by absolute improvements of 6.5% and 4.6% at the turn and dialogue levels respectively, albeit with much fewer parameters. The data and code are publicly available at <a class="link-external link-https" href="https://github.com/e0397123/xDial-Eval" rel="external noopener nofollow">this https URL</a>.

DynaEval: Unifying Turn and Dialogue Level Evaluation

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs

How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning

xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark

FFAEval: Evaluating Dialogue System Via Free-For-All Ranking

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Multi-dimensional Evaluation of Empathetic Dialog Responses

Leveraging LLMs for Dialogue Quality Measurement

FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows

PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison

Interaction Matters: An Evaluation Framework for Interactive Dialogue Assessment on English Second Language Conversations

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

DialSummEval: Revisiting Summarization Evaluation for Dialogues

Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores

DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge Graphs

On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems

Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation