A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

KuanChao Chu,Yi-Pei Chen,Hideki Nakayama

2024-06-14

Abstract:This research investigates prompt designs of evaluating generated texts using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for open-ended text evaluation remains challenging due to model sensitivity and subjectivity in evaluation of text generation. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a different level of rule understanding in the prompt. An additional optimization may enhance scoring alignment if sufficient data is available. This insight is crucial for improving the accuracy and consistency of LLM-based evaluations.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to design effective prompts when using large language models (LLMs) for text generation evaluation. Although LLMs are becoming more and more common in scoring various inputs, creating effective prompts for open - text evaluation remains challenging, mainly due to the model's high sensitivity to slight changes in input prompts and the subjectivity in text generation evaluation. Specifically, the research focuses on the impact of different output instruction orders and whether providing explanatory reasons before scoring can affect the LLM's scoring results. The research explores how these factors affect the LLM's scoring by experimenting with different prompt structures, changing the order of output instructions, and including explanatory reasons. The study finds that the order of presenting reasons and scoring significantly affects the LLM's scoring, indicating that the design of prompts is crucial for improving the accuracy of LLM - based evaluation. Moreover, if sufficient data can be obtained, further optimization may enhance the consistency of scoring. In conclusion, the paper aims to improve the method of using LLMs for text generation evaluation by analyzing the impact of prompt design on dialogue evaluation, especially how the order of output instructions affects the scoring results. This research is of great significance for improving the accuracy and consistency of LLMs in text evaluation tasks.

A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

Which is better? Exploring Prompting Strategy For LLM-based Metrics

Direct-Inverse Prompting: Analyzing LLMs' Discriminative Capacity in Self-Improving Generation

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT.

Are Large Language Models Good Prompt Optimizers?

Deliberate then Generate: Enhanced Prompting Framework for Text Generation

XPrompt:Explaining Large Language Model's Generation via Joint Prompt Attribution

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

The language of prompting: What linguistic properties make a prompt successful?

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models

Prompt Exploration with Prompt Regression

MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Automatic Prompt Selection for Large Language Models

Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem

SPELL: Semantic Prompt Evolution based on a LLM

Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing

E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

PACE: Improving Prompt with Actor-Critic Editing for Large Language Model

Learning from Contrastive Prompts: Automated Optimization and Adaptation