Abstract:The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics like BLEU and ROUGE, while useful, are increasingly inadequate for capturing the subtle semantics and contextual richness of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges. Through experiments on three open-ended question-answering tasks, we demonstrate that combining multiple LLMs-as-judges significantly improves the reliability and accuracy of evaluations, particularly in complex tasks where a single model might struggle. Our findings reveal a strong correlation with human evaluations, establishing our method as a viable and effective alternative to traditional metrics and human judgments, particularly in the context of LLM-based chat assistants where the complexity and diversity of responses challenge existing benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the automatic evaluation of free - form text generation tasks, traditional evaluation metrics such as BLEU and ROUGE cannot fully capture the semantic nuances and contextual richness of the generated text. Specifically, these traditional metrics mainly focus on the similarity of surface forms, ignoring semantically equivalent but lexically and structurally diverse expressions. Moreover, these metrics perform poorly when evaluating open - ended generation or free - form text, because there are multiple acceptable answers in such tasks. This limitation is particularly evident when evaluating instruction - tuned chat models, which tend to produce more lengthy and diverse responses. To overcome these problems, the paper proposes a reference - guided verdict method, which automates the evaluation process by using multiple large - language models (LLMs) as judges. This method aims to improve the reliability and accuracy of evaluation, especially in complex tasks where a single model may be insufficient. Experimental results show that combining multiple LLMs as judges significantly improves the reliability and accuracy of evaluation and has a strong correlation with human evaluation results, thus providing a viable alternative to traditional metrics and human judgment. The main contributions of the paper include: - Proposing a reference - guided verdict method for context - aware automation evaluation of free - form output. - Demonstrating that combining multiple LLMs as judges can enhance the reliability and accuracy of evaluation, especially in complex tasks. - Showing that when LLMs are instructed to explain their decisions, they can provide consistent evaluation and show higher sensitivity to open and detailed prompts, emphasizing the importance of prompt design in automated evaluation. - Verifying the effectiveness of the proposed method by comparing it with human evaluation results, showing a strong correlation, and establishing the method as a viable alternative to human judgment.

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Human-Centered Design Recommendations for LLM-as-a-Judge

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

A Comprehensive Analysis of the Effectiveness of Large Language Models As Automatic Dialogue Evaluators

Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?

Can Large Language Models Be an Alternative to Human Evaluations?

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

A Survey on LLM-as-a-Judge

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Can LLM be a Personalized Judge?