A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

KuanChao Chu,Yi-Pei Chen,Hideki Nakayama
2024-06-14
Abstract:This research investigates prompt designs of evaluating generated texts using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for open-ended text evaluation remains challenging due to model sensitivity and subjectivity in evaluation of text generation. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a different level of rule understanding in the prompt. An additional optimization may enhance scoring alignment if sufficient data is available. This insight is crucial for improving the accuracy and consistency of LLM-based evaluations.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to design effective prompts when using large language models (LLMs) for text generation evaluation. Although LLMs are becoming more and more common in scoring various inputs, creating effective prompts for open - text evaluation remains challenging, mainly due to the model's high sensitivity to slight changes in input prompts and the subjectivity in text generation evaluation. Specifically, the research focuses on the impact of different output instruction orders and whether providing explanatory reasons before scoring can affect the LLM's scoring results. The research explores how these factors affect the LLM's scoring by experimenting with different prompt structures, changing the order of output instructions, and including explanatory reasons. The study finds that the order of presenting reasons and scoring significantly affects the LLM's scoring, indicating that the design of prompts is crucial for improving the accuracy of LLM - based evaluation. Moreover, if sufficient data can be obtained, further optimization may enhance the consistency of scoring. In conclusion, the paper aims to improve the method of using LLMs for text generation evaluation by analyzing the impact of prompt design on dialogue evaluation, especially how the order of output instructions affects the scoring results. This research is of great significance for improving the accuracy and consistency of LLMs in text evaluation tasks.