Abstract:Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges, focusing on a clean scenario in which inter-human agreement is high. Investigating thirteen judge models of different model sizes and families, judging answers of nine different 'examtaker models' - both base and instruction-tuned - we find that only the best (and largest) models achieve reasonable alignment with humans. However, they are still quite far behind inter-human agreement and their assigned scores may still differ with up to 5 points from human-assigned scores. In terms of their ranking of the nine exam-taker models, instead, also smaller models and even the lexical metric contains may provide a reasonable signal. Through error analysis and other studies, we identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency. The fact that even the best judges differ from humans in this comparatively simple setup suggest that caution may be wise when using judges in more complex setups. Lastly, our research rediscovers the importance of using alignment metrics beyond simple percent alignment, showing that judges with high percent agreement can still assign vastly different scores.

LLM-AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach

NewsInterview: a Dataset and a Playground to Evaluate LLMs' Ground Gap via Informational Interviews

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

A Survey of Useful LLM Evaluation

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

A Survey on LLM-as-a-Judge

LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites

State of What Art? A Call for Multi-Prompt LLM Evaluation

Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

Large Language Models as Partners in Student Essay Evaluation

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks