Abstract:The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%. All framework code, agent code, and task datasets are publicly available at <a class="link-external link-https" href="https://github.com/camel-ai/crab" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems that currently exist in the task evaluation of multimodal language model (MLM) agents in interactive environments. Specifically, the existing benchmarking frameworks have the following limitations: 1. **Single - environment limitation**: Existing benchmarks mainly focus on a single platform (such as the Web, Android, or desktop operating systems), and cannot reflect real - world application scenarios across multiple platforms. 2. **Lack of detailed general evaluation methods**: Existing evaluation methods usually only focus on whether the final goal is achieved (goal - oriented evaluation) or comparison with a predefined optimal path (trajectory - oriented evaluation), ignoring the intermediate states during task completion and the possibility of multiple valid paths. 3. **Complexity of task and evaluator construction**: The process of creating tasks and evaluators is complex and time - consuming and difficult to scale. To solve these problems, the paper introduces CRAB (Cross - Environment Agent Benchmark), a brand - new cross - environment agent benchmarking framework. The main contributions of CRAB include: - **Support for cross - environment tasks**: CRAB can evaluate the performance of agents on multiple devices and platforms, which is closer to the complex application requirements in the real world. - **Graph - based fine - grained evaluation method**: CRAB proposes a new evaluation method called "graph evaluator", which provides more detailed and fair evaluation results by decomposing tasks into multiple subtasks and checking the completion of each subtask. - **Efficient task and evaluator construction mechanism**: CRAB provides a method based on subtask combination to construct tasks and evaluators, making task creation more flexible and scalable. In addition, based on the CRAB framework, the author has also developed a cross - platform benchmark, Crab Benchmark - v0, which contains 120 practical tasks, covering common applications and tools in desktop and mobile device environments, such as calendars, e - mails, maps, browsers, and terminals. The experimental results show that in single - agent and multi - agent systems with different structures, the single - agent configuration using the GPT - 4o model has achieved the best overall completion rate (38.01%), highlighting the need for further development of more effective autonomous agents. Through these improvements, CRAB provides a more comprehensive and accurate framework for the performance evaluation of multimodal language model agents, which helps to promote research and development in related fields.

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Benchmarking Mobile Device Control Agents across Diverse Configurations

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

AgentBench: Evaluating LLMs as Agents

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

The BrowserGym Ecosystem for Web Agent Research

COMMA: A Communicative Multimodal Multi-Agent Benchmark

NeuronsMAE: A Novel Multi-Agent Reinforcement Learning Environment for Cooperative and Competitive Multi-Robot Tasks

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

MMInA: Benchmarking Multihop Multimodal Internet Agents

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot

MindAgent: Emergent Gaming Interaction

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

Task Me Anything

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Tur[k]ingBench: A Challenge Benchmark for Web Agents