Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng,Wei-Lin Chiang,Ying Sheng,Siyuan Zhuang,Zhanghao Wu,Yonghao Zhuang,Zi Lin,Zhuohan Li,Dacheng Li,Eric P. Xing,Hao Zhang,Joseph E. Gonzalez,Ion Stoica

2023-12-24

Abstract:Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at <a class="link-external link-https" href="https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating large language models (LLMs) in terms of their consistency with human preferences in multi-turn conversations and instruction-following capabilities. Existing evaluation benchmarks mainly focus on closed-ended questions and short answers, which are insufficient to fully assess these models' performance in open-ended tasks, especially their ability to follow instructions in multi-turn dialogues. Therefore, the authors propose two new benchmarking tools: MT-bench and Chatbot Arena, to better evaluate these capabilities of LLMs and explore the use of powerful LLMs as "judges" to automate the evaluation process. Specifically, the main contributions of the paper include: 1. **Systematic study of LLMs as judges**: Investigating the advantages and limitations of LLMs as judges, including positional bias, verbosity bias, self-enhancement bias, and limitations in mathematical reasoning abilities. 2. **Building high-quality human preference datasets**: Collecting high-quality questions and user interaction data through MT-bench and Chatbot Arena to evaluate the performance of LLMs. 3. **Proposing a hybrid evaluation framework**: Suggesting the combination of existing capability-based benchmarks with new preference-based benchmarks to more comprehensively assess the core capabilities of LLMs and their consistency with human preferences. Through these methods, the paper aims to provide a scalable and automated evaluation approach to better reflect human preferences for LLMs.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

JudgeBench: A Benchmark for Evaluating LLM-based Judges

A Survey on LLM-as-a-Judge

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

Can LLM be a Personalized Judge?

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

Human-Centered Design Recommendations for LLM-as-a-Judge

AgentBench: Evaluating LLMs as Agents

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents