Abstract:Software robots have long been used in Robotic Process Automation (RPA) to automate mundane and repetitive computer tasks. With the advent of Large Language Models (LLMs) and their advanced reasoning capabilities, these agents are now able to handle more complex or previously unseen tasks. However, LLM-based automation techniques in recent literature frequently rely on HTML source code for input or application-specific API calls for actions, limiting their applicability to specific environments. We propose an LLM-based agent that mimics human behavior in solving computer tasks. It perceives its environment solely through screenshot images, which are then converted into text for an LLM to process. By leveraging the reasoning capability of the LLM, we eliminate the need for large-scale human demonstration data typically required for model training. The agent only executes keyboard and mouse operations on Graphical User Interface (GUI), removing the need for pre-provided APIs to function. To further enhance the agent's performance in this setting, we propose a novel prompting strategy called Context-Aware Action Planning (CAAP) prompting, which enables the agent to thoroughly examine the task context from multiple perspectives. Our agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop, outperforming all previous studies of agents that rely solely on screen images. This method demonstrates potential for broader applications, particularly for tasks requiring coordination across multiple applications on desktops or smartphones, marking a significant advancement in the field of automation agents. Codes and models are accessible at <a class="link-external link-https" href="https://github.com/caap-agent/caap-agent" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address some key challenges in computer task automation. Specifically, it proposes an agent model based on large language models (LLM) that solves computer tasks through a front-end user interface (UI). The main issues include: 1. **Dependency on Specific Environments**: Existing automation technologies often rely on HTML source code or specific application API calls, which limits their applicability in different environments. 2. **High Data Requirements**: Many existing methods require a large amount of expert demonstration data for training, which is not only time-consuming but also costly. 3. **Lack of Flexibility**: Traditional RPA tools use rule-based algorithms, which are difficult to handle unforeseen situations and anomalies, which are very common in actual desktop tasks. To address these issues, the paper proposes a new agent model that perceives the environment only through screenshot images and performs tasks through keyboard and mouse operations. Additionally, the paper introduces a new prompting strategy called "Context-Aware Action Planning (CAAP)" to enhance the agent's performance in handling complex tasks. ### Main Contributions 1. **First LLM-based Agent Model**: This model uses the front-end UI as the input and output source, eliminating the need for large-scale human demonstration data through image-to-text conversion and LLM processing. 2. **CAAP Prompting Technique**: This technique significantly enhances the agent's decision-making ability by systematically organizing contextual information and utilizing syntax structures that trigger optimal chain-of-thought (CoT). 3. **Modular Architecture**: The agent model is divided into three modules: visual observer, decision-maker, and action executor. Each module can be independently updated, improving the system's scalability and parallel processing capabilities. ### Experimental Results - **MiniWoB++ Benchmark**: The agent achieved an average success rate of 94.5% across 73 tasks, surpassing all existing methods that rely solely on image input. - **WebShop Benchmark**: The agent achieved a task score of 62.3 on 500 test instructions, significantly outperforming Pix2Act's score of 46.7. These results indicate that the proposed method is highly efficient and adaptable in handling various computer tasks, especially those requiring coordination across multiple applications.

CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

AutoAct: Automatic Agent Learning from Scratch for QA Via Self-Planning

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

You Only Look at Screens: Multimodal Chain-of-Action Agents

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

MobileAgent: enhancing mobile control via human-machine interaction and SOP integration

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Dynamic Planning for LLM-based Graphical User Interface Automation

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Language Models can Solve Computer Tasks

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation

AUTOACT: Automatic Agent Learning from Scratch via Self-Planning