From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

Segev Shlomov,Ben wiesel,Aviad Sela,Ido Levy,Liane Galanti,Roy Abitbol

2024-09-03

Abstract:General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.

Artificial Intelligence,Multiagent Systems

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the low performance of current general - purpose web agents in practical applications, especially when completing web - based tasks, their accuracy is extremely low. Although the state - of - the - art frontier models have been adopted, the performance of these agents is still not satisfactory. The paper observes that these agents can be decomposed into two main components: Planning and Grounding. However, most of the existing research treats them as black - box systems and focuses on end - to - end evaluation, which hinders meaningful improvement. Specifically, the paper aims to analyze the key bottlenecks affecting the performance of web agents by distinguishing the two core components of planning and grounding. Through improved experiments on the Mind2Web dataset, the paper proposes new benchmark tests for each component, identifies the bottlenecks and pain points that limit the performance of agents. The research results show that, contrary to the general assumption, grounding is not a significant bottleneck, and current technologies are already able to effectively solve this problem. Instead, the planning component is the main cause of performance degradation. Through this analysis, the paper provides new insights and shows practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents. In summary, the core problem of the paper is to find out the key factors affecting the performance of web agents by refining their planning and grounding components and put forward improvement suggestions.

From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

WebCanvas: Benchmarking Web Agents in Online Environments

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Tur[k]ingBench: A Challenge Benchmark for Web Agents

AI Agents That Matter

WebSuite: Systematically Evaluating Why Web Agents Fail

WebArena: A Realistic Web Environment for Building Autonomous Agents

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents

Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

AgentStudio: A Toolkit for Building General Virtual Agents

Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems

FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents

Autonomous Evaluation and Refinement of Digital Agents

The BrowserGym Ecosystem for Web Agent Research

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis