Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Yu Gu,Boyuan Zheng,Boyu Gou,Kai Zhang,Cheng Chang,Sanjari Srivastava,Yanan Xie,Peng Qi,Huan Sun,Yu Su
2024-11-11
Abstract:Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve more efficient and secure task automation by enhancing the capabilities of language agents in complex and changeable network environments. Specifically, the paper proposes a new paradigm, that is, using large - language models (LLMs) as world models, combined with model - based planning methods, to overcome the security risks and challenges of irreversible operations encountered by existing methods when performing real - time interactions on actual websites. The method introduced in the paper is called WEB - DREAMER. It simulates the possible operation results at each step and evaluates these imagined results to determine the optimal course of action, thereby reducing direct interaction with real websites while maintaining the effectiveness of task execution. The core of WEB - DREAMER lies in the concept of "dreaming", that is, before taking any actual actions, using LLM to predict the results of each potential step, and these results are expressed in the form of state changes described in natural language. Then, score the possible actions according to these simulation results, and select the action with the highest score to execute. This process is iterated until the goal is reached or the termination condition is met. Through this method, the paper aims to explore whether LLM can be applied as an effective world model in complex network environments and how to optimize LLM to better serve this model - based planning task. In addition, the paper also verifies the performance of WEB - DREAMER on two representative online interaction benchmark tests - VisualWebArena and Mind2Web - live, proving its significant improvement over reactive baseline methods, especially in reducing the number of actual website interactions while maintaining the task completion rate. Although in some cases, tree - search - based methods perform slightly better, considering the safety and feasibility in practical applications, WEB - DREAMER provides a more flexible and practical solution.