Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge,Changsheng Zhao,Dylan Ashley,Wenyi Wang,Dmitrii Khizbullin,Yunyang Xiong,Zechun Liu,Ernie Chang,Raghuraman Krishnamoorthi,Yuandong Tian,Yangyang Shi,Vikas Chandra,Jürgen Schmidhuber
2024-10-17
Abstract:Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.
Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the inadequacies of current evaluation methods for agentic systems. Specifically, existing evaluation methods either focus solely on the final outcome, ignoring the step-by-step process by which the agentic system solves problems, or they require a significant amount of manual labor. To overcome these issues, the paper introduces the "Agent-as-a-Judge" framework, which utilizes agentic systems to evaluate other agentic systems. This framework not only retains the cost-effectiveness of the LLM-as-a-Judge but also incorporates the characteristics of agentic systems, providing rich intermediate feedback to more comprehensively evaluate the entire task-solving process. The main contributions of the paper include: 1. The release of the DevAI dataset, which contains 55 comprehensive AI development tasks, each with detailed user requirements and preference labels. 2. Benchmarking three leading open-source code generation agentic systems using three methods: human judgment, LLM-as-a-Judge, and Agent-as-a-Judge. 3. Introducing the concept of Agent-as-a-Judge, allowing agentic systems to conduct fair and rich evaluations without relying on traditional manual evaluation costs. 4. Demonstrating that Agent-as-a-Judge outperforms LLM-as-a-Judge in evaluating code generation systems and shows a high degree of consensus with human judges. Through these contributions, the paper provides important evaluation tools and methods for the development of modern agentic systems, helping to promote the self-improvement and practical application of agentic systems.