Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge,Changsheng Zhao,Dylan Ashley,Wenyi Wang,Dmitrii Khizbullin,Yunyang Xiong,Zechun Liu,Ernie Chang,Raghuraman Krishnamoorthi,Yuandong Tian,Yangyang Shi,Vikas Chandra,Jürgen Schmidhuber

2024-10-17

Abstract:Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the inadequacies of current evaluation methods for agentic systems. Specifically, existing evaluation methods either focus solely on the final outcome, ignoring the step-by-step process by which the agentic system solves problems, or they require a significant amount of manual labor. To overcome these issues, the paper introduces the "Agent-as-a-Judge" framework, which utilizes agentic systems to evaluate other agentic systems. This framework not only retains the cost-effectiveness of the LLM-as-a-Judge but also incorporates the characteristics of agentic systems, providing rich intermediate feedback to more comprehensively evaluate the entire task-solving process. The main contributions of the paper include: 1. The release of the DevAI dataset, which contains 55 comprehensive AI development tasks, each with detailed user requirements and preference labels. 2. Benchmarking three leading open-source code generation agentic systems using three methods: human judgment, LLM-as-a-Judge, and Agent-as-a-Judge. 3. Introducing the concept of Agent-as-a-Judge, allowing agentic systems to conduct fair and rich evaluations without relying on traditional manual evaluation costs. 4. Demonstrating that Agent-as-a-Judge outperforms LLM-as-a-Judge in evaluating code generation systems and shows a high degree of consensus with human judges. Through these contributions, the paper provides important evaluation tools and methods for the development of modern agentic systems, helping to promote the self-improvement and practical application of agentic systems.

Agent-as-a-Judge: Evaluate Agents with Agents

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

AgentBench: Evaluating LLMs as Agents

AI Agents That Matter

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

AgentStudio: A Toolkit for Building General Virtual Agents

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

AgentsCourt: Building Judicial Decision-Making Agents with Court Debate Simulation and Legal Knowledge Augmentation

Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents

Engineering AI Judge Systems

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

Autonomous Evaluation and Refinement of Digital Agents

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery