Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems

Yuma Tsuta,Naoki Yoshinaga,Masashi Toyoda
DOI: https://doi.org/10.5715/jnlp.30.531
2023-01-01
Journal of Natural Language Processing
Abstract:Because open-domain dialogues allow diverse responses, common reference-based metrics for text generation, such as <span>bleu</span>, do not correlate well with human judgments unless we prepare an extensive reference set of high-quality responses for input utterances. In this study, we propose a fully automatic, uncertainty-aware evaluation method for open-domain dialogue systems, υ<span>bleu</span>. Our method first collects diverse reference responses from massive dialogue data, annotates their quality judgments by using a neural network trained on automatically collected training data, and then computes weighted <span>bleu</span> using the automatically-retrieved and -rated reference responses. We also employ this method with an embedding-based metric, <span>berts</span>core, instead of the word-overlap-based metric, <span>bleu</span>, to absorb surface variations of the reference responses. The experimental results on the meta-evaluation of our evaluation method for dialogue systems based on massive Twitter data confirmed that our method substantially improves correlations between <span>bleu</span> (or <span>berts</span>core) and human judgments. We also confirmed that our method is effective when it is combined with a reference-free metric.
What problem does this paper attempt to address?