Abstract:Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built $\textbf{AppWorld Engine}$, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created $\textbf{AppWorld Benchmark}$ (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.

What problem does this paper attempt to address?

The paper aims to address the challenges faced by current autonomous agents in handling everyday digital tasks, particularly those requiring complex interactions and decision-making across multiple applications (such as notepads, messaging apps, shopping apps, etc.). Existing benchmarks often cover tasks that require simple API calls and fail to adequately assess the agents' performance in real-world tasks that necessitate generating rich code and complex control flows. To tackle this issue, the research team developed the AppWorld framework, which consists of two parts: 1. **AppWorld Engine**: This is a high-quality execution environment simulator that includes 9 everyday applications (such as notepads, payment apps, etc.), interacting with the outside world through 457 API interfaces and populated with the life data of approximately 100 virtual users. This environment allows agents to operate these applications via APIs without any real-world consequences or resource consumption. 2. **AppWorld Benchmark**: This is a set of 750 natural, diverse, and challenging tasks, each requiring the agent to write complex code to complete. Additionally, a set of programmatic evaluation criteria is provided to check whether the agent has correctly achieved the task goals, while also considering different but equally effective solution methods. Specifically, the main contributions of the paper include: - Developing a fully controllable, stable, and reproducible application execution environment—AppWorld Engine. - Creating a benchmark set with complex tasks—AppWorld Benchmark, covering various application combinations in everyday scenarios. - Designing a state-based programmatic evaluation method that can accurately verify the way agents complete tasks. - Benchmarking several language models using powerful prompt-based methods, showing that even the best models can only solve about 49% of "normal" tasks and about 30% of "challenge" tasks, indicating the high difficulty of the benchmark. In summary, the goal of the paper is to advance interactive coding agent technology by providing a highly realistic multi-application simulation environment and a comprehensive complex task benchmark.

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games

WebArena: A Realistic Web Environment for Building Autonomous Agents

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

Benchmarking Agentic Workflow Generation

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Towards Evaluating Generalist Agents: An Automated Benchmark in Open World

Benchmarking Mobile Device Control Agents across Diverse Configurations

AgentStudio: A Toolkit for Building General Virtual Agents

OpenHands: An Open Platform for AI Software Developers as Generalist Agents