Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Hyungjoo Chae,Namyoung Kim,Kai Tzu-iunn Ong,Minju Gwak,Gwanwoo Song,Jihoon Kim,Sunghwan Kim,Dongha Lee,Jinyoung Yeo
2024-10-17
Abstract:Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the current web agents based on large - language models (LLMs) have poor performance in long - term tasks. In particular, they are unable to foresee the results of their actions like humans and thus avoid irreversible mistakes (such as repeatedly purchasing non - refundable airline tickets). Humans avoid adverse situations by considering the possible consequences of their actions, and this awareness is called the "world model". However, the existing LLM - based web agents lack this world model, causing them to rely on the trial - and - error method when making decisions, which is not only inefficient but may also lead to irreversible errors. Specifically, the paper first confirms through preliminary analysis that current LLMs (such as GPT - 4, Claude - 3.5 - Sonnet, etc.) do indeed lack an understanding of environmental dynamics, that is, they have difficulty predicting how their actions will affect the environmental state. Based on this finding, the author proposes an enhanced World - Model - Augmented (WMA) web agent that can simulate the results of its actions to achieve better decision - making. To overcome the challenges faced when training LLMs as world models for predicting the next observation, such as repeated elements between observations and long HTML inputs, the author proposes a transformation - focused observation abstraction method, where the prediction target is limited to free - form natural - language descriptions that highlight important state differences between time steps. The experimental results show that the world model trained using this method can improve the agent's policy selection without training the policy model, and compared with the recent tree - search agents, it improves cost - and - time efficiency by 6.8 times and 5.3 times respectively.