CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Tianqi Xu,Linyao Chen,Dai-Jie Wu,Yanjun Chen,Zecheng Zhang,Xiang Yao,Zhiqiang Xie,Yongchao Chen,Shilong Liu,Bochen Qian,Anjie Yang,Zhaoxuan Jin,Jianbo Deng,Philip Torr,Bernard Ghanem,Guohao Li
2024-10-18
Abstract:The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%. All framework code, agent code, and task datasets are publicly available at <a class="link-external link-https" href="https://github.com/camel-ai/crab" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems that currently exist in the task evaluation of multimodal language model (MLM) agents in interactive environments. Specifically, the existing benchmarking frameworks have the following limitations: 1. **Single - environment limitation**: Existing benchmarks mainly focus on a single platform (such as the Web, Android, or desktop operating systems), and cannot reflect real - world application scenarios across multiple platforms. 2. **Lack of detailed general evaluation methods**: Existing evaluation methods usually only focus on whether the final goal is achieved (goal - oriented evaluation) or comparison with a predefined optimal path (trajectory - oriented evaluation), ignoring the intermediate states during task completion and the possibility of multiple valid paths. 3. **Complexity of task and evaluator construction**: The process of creating tasks and evaluators is complex and time - consuming and difficult to scale. To solve these problems, the paper introduces CRAB (Cross - Environment Agent Benchmark), a brand - new cross - environment agent benchmarking framework. The main contributions of CRAB include: - **Support for cross - environment tasks**: CRAB can evaluate the performance of agents on multiple devices and platforms, which is closer to the complex application requirements in the real world. - **Graph - based fine - grained evaluation method**: CRAB proposes a new evaluation method called "graph evaluator", which provides more detailed and fair evaluation results by decomposing tasks into multiple subtasks and checking the completion of each subtask. - **Efficient task and evaluator construction mechanism**: CRAB provides a method based on subtask combination to construct tasks and evaluators, making task creation more flexible and scalable. In addition, based on the CRAB framework, the author has also developed a cross - platform benchmark, Crab Benchmark - v0, which contains 120 practical tasks, covering common applications and tools in desktop and mobile device environments, such as calendars, e - mails, maps, browsers, and terminals. The experimental results show that in single - agent and multi - agent systems with different structures, the single - agent configuration using the GPT - 4o model has achieved the best overall completion rate (38.01%), highlighting the need for further development of more effective autonomous agents. Through these improvements, CRAB provides a more comprehensive and accurate framework for the performance evaluation of multimodal language model agents, which helps to promote research and development in related fields.