Abstract:Recently, large language models (LLMs) have demonstrated exceptional capabilities in serving as the foundation for AI assistants. One emerging application of LLMs, navigating through websites and interacting with UI elements across various web pages, remains somewhat underexplored. We introduce Steward, a novel LLM-powered web automation tool designed to serve as a cost-effective, scalable, end-to-end solution for automating web interactions. Traditional browser automation frameworks like Selenium, Puppeteer, and Playwright are not scalable for extensive web interaction tasks, such as studying recommendation algorithms on platforms like YouTube and Twitter. These frameworks require manual coding of interactions, limiting their utility in large-scale or dynamic contexts. Steward addresses these limitations by integrating LLM capabilities with browser automation, allowing for natural language-driven interaction with websites. Steward operates by receiving natural language instructions and reactively planning and executing a sequence of actions on websites, looping until completion, making it a practical tool for developers and researchers to use. It achieves high efficiency, completing actions in 8.52 to 10.14 seconds at a cost of $0.028 per action or an average of $0.18 per task, which is further reduced to 4.8 seconds and $0.022 through a caching mechanism. It runs tasks on real websites with a 40% completion success rate. We discuss various design and implementation challenges, including state representation, action sequence selection, system responsiveness, detecting task completion, and caching implementation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing browser automation frameworks (such as Selenium, Puppeteer, and Playwright) when dealing with large - scale or dynamic web interaction tasks. These traditional frameworks require manual coding of interactions with web page elements, which limits their practicality in large - scale testing or dynamic environments. In particular, when researching dynamically generated and location/context - dependent content such as recommendation algorithms, these frameworks perform poorly. To solve these problems, the paper proposes a new tool named Steward. Steward is a web automation tool based on large - language models (LLMs), aiming to provide a cost - effective and scalable end - to - end solution for automating web interactions. By integrating the capabilities of LLMs with browser automation technology, Steward allows interaction with websites using natural - language instructions. Specifically, Steward receives natural - language instructions and plans and executes a series of operations until the task is completed. This process makes Steward a practical tool for developers and researchers, enabling them to complete tasks efficiently while maintaining low costs. The main contributions of the paper include: 1. Designing a unique LLM - based web executor that can be easily integrated into browser automation frameworks. Steward is specifically designed to work with the Playwright framework, is fully autonomous, and only requires users to input high - level goals/tasks in natural - language form. 2. Developing a context - aware, website/application - independent UI practice system that can automate web interactions on a large scale. Steward can generalize its knowledge, navigate, and interact with various websites. For the first five elements, Steward can achieve 81.44% top - action + element - selection accuracy without any training or fine - tuning. 3. Conducting an in - depth evaluation of Steward's running time and cost. Its system design is optimized to maximize running - time and cost efficiency, achieving a median running time of 8.52 seconds or 10.14 seconds, with a cost of $0.028 per operation. In addition, a cache mechanism is implemented for storing and reusing website interactions, reducing the running time and cost per step by 43.7% and 53.6% respectively. Through these contributions, Steward aims to overcome the limitations of existing automation tools and provide a more flexible, reliable, and efficient solution for a wide range of web automation tasks.

Steward: Natural Language Web Automation

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

SteP: Stacked LLM Policies for Web Actions

PAFFA: Premeditated Actions For Fast Agents

ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

AutoDroid: LLM-powered Task Automation in Android

Empowering LLM to use Smartphone for Intelligent Task Automation

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models

OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models

WebArena: A Realistic Web Environment for Building Autonomous Agents

Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Glider: A Reinforcement Learning Approach to Extract UI Scripts from Websites

CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

TaskBench: Benchmarking Large Language Models for Task Automation

Grounding Open-Domain Instructions to Automate Web Support Tasks

Intelligent Virtual Assistants with LLM-based Process Automation

WebRobot: Web Robotic Process Automation using Interactive Programming-by-Demonstration