A Comprehensive Analysis of the Effectiveness of Large Language Models As Automatic Dialogue Evaluators

Chen Zhang,Luis Fernando D'Haro,Yiming Chen,Malu Zhang,Haizhou Li
DOI: https://doi.org/10.1609/aaai.v38i17.29923
2024-01-01
Abstract:Automatic evaluation is an integral aspect of dialogue system research. Thetraditional reference-based NLG metrics are generally found to be unsuitablefor dialogue assessment. Consequently, recent studies have suggested variousunique, reference-free neural metrics that better align with human evaluations.Notably among them, large language models (LLMs), particularly theinstruction-tuned variants like ChatGPT, are shown to be promising substitutesfor human judges. Yet, existing works on utilizing LLMs for automatic dialogueevaluation are limited in their scope in terms of the number of meta-evaluationdatasets, mode of evaluation, coverage of LLMs, etc. Hence, it remainsinconclusive how effective these LLMs are. To this end, we conduct acomprehensive study on the application of LLMs for automatic dialogueevaluation. Specifically, we analyze the multi-dimensional evaluationcapability of 30 recently emerged LLMs at both turn and dialogue levels, usinga comprehensive set of 12 meta-evaluation datasets. Additionally, we probe therobustness of the LLMs in handling various adversarial perturbations at bothturn and dialogue levels. Finally, we explore how model-level anddimension-level ensembles impact the evaluation performance. All resources areavailable at https://github.com/e0397123/comp-analysis.
What problem does this paper attempt to address?