Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Hyungjoo Chae,Namyoung Kim,Kai Tzu-iunn Ong,Minju Gwak,Gwanwoo Song,Jihoon Kim,Sunghwan Kim,Dongha Lee,Jinyoung Yeo

2024-10-17

Abstract:Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the current web agents based on large - language models (LLMs) have poor performance in long - term tasks. In particular, they are unable to foresee the results of their actions like humans and thus avoid irreversible mistakes (such as repeatedly purchasing non - refundable airline tickets). Humans avoid adverse situations by considering the possible consequences of their actions, and this awareness is called the "world model". However, the existing LLM - based web agents lack this world model, causing them to rely on the trial - and - error method when making decisions, which is not only inefficient but may also lead to irreversible errors. Specifically, the paper first confirms through preliminary analysis that current LLMs (such as GPT - 4, Claude - 3.5 - Sonnet, etc.) do indeed lack an understanding of environmental dynamics, that is, they have difficulty predicting how their actions will affect the environmental state. Based on this finding, the author proposes an enhanced World - Model - Augmented (WMA) web agent that can simulate the results of its actions to achieve better decision - making. To overcome the challenges faced when training LLMs as world models for predicting the next observation, such as repeated elements between observations and long HTML inputs, the author proposes a transformation - focused observation abstraction method, where the prediction target is limited to free - form natural - language descriptions that highlight important state differences between time steps. The experimental results show that the world model trained using this method can improve the agent's policy selection without training the policy model, and compared with the recent tree - search agents, it improves cost - and - time efficiency by 6.8 times and 5.3 times respectively.

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

AutoWebGLM: A Large Language Model-based Web Navigating Agent

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Tree Search for Language Model Agents

Large Language Models Empowered Personalized Web Agents

WebArena: A Realistic Web Environment for Building Autonomous Agents

Large Language Models Can Self-Improve At Web Agent Tasks

Language-Guided World Models: A Model-Based Approach to AI Control

Predictive World Models from Real-World Partial Observations

Learning to Model the World with Language

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

LASER: LLM Agent with State-Space Exploration for Web Navigation

Adaptive and transparent decision-making in autonomous robots through graph-structured world models