RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Qiyuan Zhang,Yufei Wang,Tiezheng YU,Yuxin Jiang,Chuhan Wu,Liangyou Li,Yasheng Wang,Xin Jiang,Lifeng Shang,Ruiming Tang,Fuyuan Lyu,Chen Ma
2024-10-08
Abstract:With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the reliability and accuracy of large - language models (LLM) as evaluators (i.e., LLM - as - a - Judge) in text - generation - quality assessment. Although LLM - as - a - Judge has become a cost - effective alternative to human evaluation, there is still a reliability gap between it and human evaluation, especially when dealing with open - ended instruction - following tasks. An important reason for this gap is the lack of effective reference standards (oracle) during the evaluation process. To solve this problem, the authors propose REVIS EVAL, a new paradigm for improving text - generation evaluation through response - adapted references. ### Specific Problems and Solutions: 1. **Reliability Gap**: - **Problem**: When evaluating text - generation quality, LLM - as - a - Judge is less reliable than human evaluation due to the lack of effective reference standards. - **Solution**: REVIS EVAL uses the text - revision ability of large - language models to adaptively revise the generated responses and uses the revised text as a reference standard (response - adapted references), thereby improving the accuracy and reliability of the evaluation. 2. **Challenges of Reference Standards**: - **Problem**: Traditional reference standards may introduce noise, especially in many - to - one problems, that is, for a given task input, there are multiple diverse and valid responses. - **Solution**: The response - adapted reference standards generated by REVIS EVAL not only maintain high quality but are also highly relevant to the responses to be evaluated, thereby reducing noise and bias. 3. **Limitations of Evaluation Methods**: - **Problem**: Existing evaluation methods, such as reference - free evaluation and reference - based evaluation, each have limitations. Reference - free evaluation may not be able to capture the subtle differences in the text, while reference - based evaluation may be limited by specific reference standards. - **Solution**: REVIS EVAL combines the advantages of both. By generating response - adapted reference standards, it retains the advantages of reference standards and avoids their limitations. ### Main Contributions: 1. **Proposing the REVIS EVAL Paradigm**: By generating response - adapted reference standards, the evaluation performance of LLM - as - a - Judge is improved. 2. **Verifying Effectiveness**: Through extensive experiments, it is proved that REVIS EVAL is superior to traditional reference - free and reference - based evaluation methods in various natural - language - generation tasks and open - ended instruction - following tasks. 3. **Improving Classic Evaluation Metrics**: The response - adapted reference standards generated by REVIS EVAL can significantly improve the performance of classic evaluation metrics (such as BLEU and BERTScore), and in some cases can even be comparable to the reference - free evaluation of LLM - as - a - Judge. In summary, this paper aims to solve the reliability and accuracy problems of existing LLM - as - a - Judge in text - generation - quality evaluation through the REVIS EVAL paradigm, thereby providing a more effective and reliable evaluation method.