LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

Yi-Pei Chen,KuanChao Chu,Hideki Nakayama
2024-06-05
Abstract:This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the scores of large language models (LLMs) in dialogue evaluation are affected by prompt design. Specifically, the research focuses on the impact of the order of output instructions on the scores of LLMs and whether adding explanatory reasons in dialogue evaluation can improve the quality of evaluation. The paper explores how these changes affect the scoring results of LLMs by experimenting with different prompt structures, such as changing the order of output instructions and including explanatory reasons. The study finds that the "reason - first and then scoring" method can produce more comprehensive evaluations, which is of great significance for improving the accuracy and consistency of LLM - based evaluations.