WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi,Xiao Liu,Iat Long Iong,Hanyu Lai,Xueqiao Sun,Xinyue Yang,Jiadai Sun,Yu Yang,Shuntian Yao,Tianjie Zhang,Wei Xu,Jie Tang,Yuxiao Dong
DOI: https://doi.org/10.48550/arXiv.2411.02337
2024-11-05
Abstract:Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on how to use open - source large language models (LLMs) to train efficient web agents, especially achieving this goal through the self - evolving curriculum reinforcement learning framework (WEBRL) in an online environment. Specifically, the paper aims to overcome the following three key challenges: 1. **Scarcity of training tasks**: Unlike offline datasets, online benchmarks such as WebArena usually only provide a limited test set for evaluation, which greatly restricts the effective training of agents in these environments. 2. **Sparsity and cost of feedback signals**: Due to the lack of task - specific evaluation functions, it becomes difficult to successfully evaluate arbitrary web - browsing tasks. Moreover, tasks in WebArena usually have a long time span and on average require about 10 steps to complete, which leads to a significant sparsity of available signals during the online exploration process. 3. **Policy distribution drift in online learning**: Since there is no predefined training set, online exploration must be carried out, which inevitably leads to the distribution drift of agent policies and may cause catastrophic forgetting and performance degradation. To address these challenges, the paper proposes the WEBRL framework, which generates new tasks through a self - evolving curriculum learning strategy and combines a powerful outcome - supervised reward model (ORM) and an adaptive reinforcement learning strategy to ensure continuous performance improvement. Experimental results show that WEBRL can significantly improve the success rate of open - source LLMs on WebArena - Lite, even surpassing the state - of - the - art proprietary LLM APIs (such as GPT - 4 - Turbo) and other open - source LLM - based web agents.