Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng,Wei-Lin Chiang,Ying Sheng,Siyuan Zhuang,Zhanghao Wu,Yonghao Zhuang,Zi Lin,Zhuohan Li,Dacheng Li,Eric P. Xing,Hao Zhang,Joseph E. Gonzalez,Ion Stoica
2023-12-24
Abstract:Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at <a class="link-external link-https" href="https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating large language models (LLMs) in terms of their consistency with human preferences in multi-turn conversations and instruction-following capabilities. Existing evaluation benchmarks mainly focus on closed-ended questions and short answers, which are insufficient to fully assess these models' performance in open-ended tasks, especially their ability to follow instructions in multi-turn dialogues. Therefore, the authors propose two new benchmarking tools: MT-bench and Chatbot Arena, to better evaluate these capabilities of LLMs and explore the use of powerful LLMs as "judges" to automate the evaluation process. Specifically, the main contributions of the paper include: 1. **Systematic study of LLMs as judges**: Investigating the advantages and limitations of LLMs as judges, including positional bias, verbosity bias, self-enhancement bias, and limitations in mathematical reasoning abilities. 2. **Building high-quality human preference datasets**: Collecting high-quality questions and user interaction data through MT-bench and Chatbot Arena to evaluate the performance of LLMs. 3. **Proposing a hybrid evaluation framework**: Suggesting the combination of existing capability-based benchmarks with new preference-based benchmarks to more comprehensively assess the core capabilities of LLMs and their consistency with human preferences. Through these methods, the paper aims to provide a scalable and automated evaluation approach to better reflect human preferences for LLMs.