Abstract:With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the reliability and accuracy of large - language models (LLM) as evaluators (i.e., LLM - as - a - Judge) in text - generation - quality assessment. Although LLM - as - a - Judge has become a cost - effective alternative to human evaluation, there is still a reliability gap between it and human evaluation, especially when dealing with open - ended instruction - following tasks. An important reason for this gap is the lack of effective reference standards (oracle) during the evaluation process. To solve this problem, the authors propose REVIS EVAL, a new paradigm for improving text - generation evaluation through response - adapted references. ### Specific Problems and Solutions: 1. **Reliability Gap**: - **Problem**: When evaluating text - generation quality, LLM - as - a - Judge is less reliable than human evaluation due to the lack of effective reference standards. - **Solution**: REVIS EVAL uses the text - revision ability of large - language models to adaptively revise the generated responses and uses the revised text as a reference standard (response - adapted references), thereby improving the accuracy and reliability of the evaluation. 2. **Challenges of Reference Standards**: - **Problem**: Traditional reference standards may introduce noise, especially in many - to - one problems, that is, for a given task input, there are multiple diverse and valid responses. - **Solution**: The response - adapted reference standards generated by REVIS EVAL not only maintain high quality but are also highly relevant to the responses to be evaluated, thereby reducing noise and bias. 3. **Limitations of Evaluation Methods**: - **Problem**: Existing evaluation methods, such as reference - free evaluation and reference - based evaluation, each have limitations. Reference - free evaluation may not be able to capture the subtle differences in the text, while reference - based evaluation may be limited by specific reference standards. - **Solution**: REVIS EVAL combines the advantages of both. By generating response - adapted reference standards, it retains the advantages of reference standards and avoids their limitations. ### Main Contributions: 1. **Proposing the REVIS EVAL Paradigm**: By generating response - adapted reference standards, the evaluation performance of LLM - as - a - Judge is improved. 2. **Verifying Effectiveness**: Through extensive experiments, it is proved that REVIS EVAL is superior to traditional reference - free and reference - based evaluation methods in various natural - language - generation tasks and open - ended instruction - following tasks. 3. **Improving Classic Evaluation Metrics**: The response - adapted reference standards generated by REVIS EVAL can significantly improve the performance of classic evaluation metrics (such as BLEU and BERTScore), and in some cases can even be comparable to the reference - free evaluation of LLM - as - a - Judge. In summary, this paper aims to solve the reliability and accuracy problems of existing LLM - as - a - Judge in text - generation - quality evaluation through the REVIS EVAL paradigm, thereby providing a more effective and reliable evaluation method.

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

RepEval: Effective Text Evaluation with LLM Representation

Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

LLM-Ref: Enhancing Reference Handling in Technical Writing with Large Language Models

Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation

Intrinsic Task-based Evaluation for Referring Expression Generation

Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models

JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

Exploring Precision and Recall to assess the quality and diversity of LLMs