Abstract:Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories. Although effective, these metrics evaluate individual responses directly rather than considering their relative quality compared to other responses. To handle this, we propose PairEval, a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. PairEval is built on top of open-sourced and moderate-size language models, and we make them specialized in pairwise comparison between dialogue responses. Extensive experiments on multiple benchmarks demonstrate that our metric exhibits a higher correlation with human judgments than baseline metrics. We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems, including repetition and speaker insensitivity.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to construct a reliable and automated dialogue evaluation metric, especially for open - domain dialogue systems. Although existing evaluation metrics can effectively evaluate the relevance between the generated response and the dialogue history, they usually directly evaluate the quality of a single response without considering the relative quality compared to other responses. This causes these metrics to possibly fail to accurately reflect human judgments on dialogue quality in some cases. To overcome this challenge, the paper proposes **PAIREVAL**, a new dialogue evaluation metric based on pairwise comparison. PAIREVAL evaluates the quality of responses by comparing responses in different dialogues, thus providing a more accurate evaluation. This method not only improves the relevance to human judgment but also enhances the ability to detect common failures in open - domain dialogue systems (such as repetition and speaker - insensitivity). Specifically, PAIREVAL uses a medium - scale open - source language model and fine - tunes it with a public dialogue corpus to enable it to have the ability of pairwise comparison. Experimental results show that PAIREVAL exhibits a higher correlation than baseline metrics in multiple benchmark tests and even outperforms metrics relying on powerful proprietary language models in some cases. ### Main Contributions 1. **Propose PAIREVAL**: A dialogue evaluation metric based on pairwise comparison that can more accurately evaluate the response quality of open - domain dialogue systems. 2. **Improve Relevance**: PAIREVAL shows a higher relevance to human judgment in multiple benchmark tests. 3. **Enhance Robustness**: PAIREVAL shows stronger robustness in detecting common failures of dialogue systems. ### Method Overview 1. **Task Definition**: Given a dialogue history and a generated response, the evaluation metric needs to determine whether the response is suitable as the next dialogue. 2. **Pairwise Comparison**: Calculate the final evaluation score by comparing the quality of the target response with a small number of contrast responses. 3. **Model Training**: Use synthetic training samples to fine - tune the language model to enable it to have the ability of pairwise comparison. ### Experimental Results Experimental results show that PAIREVAL performs excellently in multiple benchmark tests, especially in terms of the relevance to human judgment. In addition, PAIREVAL also performs well in detecting common failures of dialogue systems. ### Conclusion By introducing the pairwise comparison method, PAIREVAL significantly improves the accuracy and reliability of dialogue evaluation and provides a new and effective tool for the evaluation of open - domain dialogue systems.

PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue

An Evaluation Protocol for Generative Conversational Systems

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Evaluating Dialogue Generation Systems via Response Selection

ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Enhancing the Open-Domain Dialogue Evaluation in Latent Space

On Evaluating and Comparing Open Domain Dialog Systems

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

BotEval: Facilitating Interactive Human Evaluation

CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems

DynaEval: Unifying Turn and Dialogue Level Evaluation