AutoWebGLM: A Large Language Model-based Web Navigating Agent

Hanyu Lai,Xiao Liu,Iat Long Iong,Shuntian Yao,Yuxuan Chen,Pengbo Shen,Hao Yu,Hanchen Zhang,Xiaohan Zhang,Yuxiao Dong,Jie Tang
2024-10-12
Abstract:Large language models (LLMs) have fueled many intelligent web agents, but most existing ones perform far from satisfying in real-world web navigation tasks due to three factors: (1) the complexity of HTML text data (2) versatility of actions on webpages, and (3) task difficulty due to the open-domain nature of the web. In light of these challenges, we develop the open AutoWebGLM based on ChatGLM3-6B. AutoWebGLM can serve as a powerful automated web navigation agent that outperform GPT-4. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages with vital information preserved succinctly. We then employ a hybrid human-AI method to build web browsing data for curriculum training. Finally, we bootstrap the model by reinforcement learning and rejection sampling to further facilitate webpage comprehension, browser operations, and efficient task decomposition by itself. For comprehensive evaluation, we establish a bilingual benchmark -- AutoWebBench -- for real-world web navigation tasks. We evaluate AutoWebGLM across diverse web navigation benchmarks, demonstrating its potential to tackle challenging tasks in real environments. Related code, model, and data are released at \url{<a class="link-external link-https" href="https://github.com/THUDM/AutoWebGLM" rel="external noopener nofollow">this https URL</a>}.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of poor performance of existing large language model (LLM)-based automatic web agents in real-world web navigation tasks. Specifically, these agents exhibit significant shortcomings when dealing with the following three challenges: 1. **Complexity of HTML Text Data**: Web pages contain a large amount of lengthy and structurally complex HTML code, making it difficult for LLMs to effectively understand and manipulate web content. 2. **Diversity of Actions on Web Pages**: There is a wide variety of interactive actions on web pages, including clicking, scrolling, and inputting, which existing agents struggle to comprehensively cover. 3. **Difficulty of Open-Domain Tasks**: The openness and diversity of the internet make task completion more challenging, and existing agents lack the ability to perform correct reasoning and self-checking in open-domain environments. To address these challenges, the authors developed AutoWebGLM, an automatic web navigation agent based on ChatGLM3-6B. By designing an HTML simplification algorithm, constructing a hybrid human-machine dataset, and employing methods such as reinforcement learning and rejection sampling fine-tuning, AutoWebGLM is able to perform excellently in various web navigation tasks, even surpassing GPT-4. Additionally, the authors created a bilingual benchmark dataset, AutoWebBench, to evaluate the agent's performance in real-world environments.