AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Harsh Trivedi,Tushar Khot,Mareike Hartmann,Ruskin Manku,Vinty Dong,Edward Li,Shashank Gupta,Ashish Sabharwal,Niranjan Balasubramanian
2024-07-27
Abstract:Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built $\textbf{AppWorld Engine}$, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created $\textbf{AppWorld Benchmark}$ (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.
Software Engineering,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the challenges faced by current autonomous agents in handling everyday digital tasks, particularly those requiring complex interactions and decision-making across multiple applications (such as notepads, messaging apps, shopping apps, etc.). Existing benchmarks often cover tasks that require simple API calls and fail to adequately assess the agents' performance in real-world tasks that necessitate generating rich code and complex control flows. To tackle this issue, the research team developed the AppWorld framework, which consists of two parts: 1. **AppWorld Engine**: This is a high-quality execution environment simulator that includes 9 everyday applications (such as notepads, payment apps, etc.), interacting with the outside world through 457 API interfaces and populated with the life data of approximately 100 virtual users. This environment allows agents to operate these applications via APIs without any real-world consequences or resource consumption. 2. **AppWorld Benchmark**: This is a set of 750 natural, diverse, and challenging tasks, each requiring the agent to write complex code to complete. Additionally, a set of programmatic evaluation criteria is provided to check whether the agent has correctly achieved the task goals, while also considering different but equally effective solution methods. Specifically, the main contributions of the paper include: - Developing a fully controllable, stable, and reproducible application execution environment—AppWorld Engine. - Creating a benchmark set with complex tasks—AppWorld Benchmark, covering various application combinations in everyday scenarios. - Designing a state-based programmatic evaluation method that can accurately verify the way agents complete tasks. - Benchmarking several language models using powerful prompt-based methods, showing that even the best models can only solve about 49% of "normal" tasks and about 30% of "challenge" tasks, indicating the high difficulty of the benchmark. In summary, the goal of the paper is to advance interactive coding agent technology by providing a highly realistic multi-application simulation environment and a comprehensive complex task benchmark.