Abstract:Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve more efficient and secure task automation by enhancing the capabilities of language agents in complex and changeable network environments. Specifically, the paper proposes a new paradigm, that is, using large - language models (LLMs) as world models, combined with model - based planning methods, to overcome the security risks and challenges of irreversible operations encountered by existing methods when performing real - time interactions on actual websites. The method introduced in the paper is called WEB - DREAMER. It simulates the possible operation results at each step and evaluates these imagined results to determine the optimal course of action, thereby reducing direct interaction with real websites while maintaining the effectiveness of task execution. The core of WEB - DREAMER lies in the concept of "dreaming", that is, before taking any actual actions, using LLM to predict the results of each potential step, and these results are expressed in the form of state changes described in natural language. Then, score the possible actions according to these simulation results, and select the action with the highest score to execute. This process is iterated until the goal is reached or the termination condition is met. Through this method, the paper aims to explore whether LLM can be applied as an effective world model in complex network environments and how to optimize LLM to better serve this model - based planning task. In addition, the paper also verifies the performance of WEB - DREAMER on two representative online interaction benchmark tests - VisualWebArena and Mind2Web - live, proving its significant improvement over reactive baseline methods, especially in reducing the number of actual website interactions while maintaining the task completion rate. Although in some cases, tree - search - based methods perform slightly better, considering the safety and feasibility in practical applications, WEB - DREAMER provides a more flexible and practical solution.

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Large Language Model Powered Agents in the Web

Testing and Understanding Erroneous Planning in LLM Agents through Synthesized User Inputs

Tree Search for Language Model Agents

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Understanding the planning of LLM agents: A survey

User Behavior Simulation with Large Language Model based Agents

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

AutoWebGLM: A Large Language Model-based Web Navigating Agent

Agent Planning with World Knowledge Model

Language-Guided World Models: A Model-Based Approach to AI Control

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

World Models: The Safety Perspective

Smart Language Agents in Real-World Planning

Query-Efficient Planning with Language Models

KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface

Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents