CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

Junhee Cho,Jihoon Kim,Daseul Bae,Jinho Choo,Youngjune Gwon,Yeong-Dae Kwon
2024-10-18
Abstract:Software robots have long been used in Robotic Process Automation (RPA) to automate mundane and repetitive computer tasks. With the advent of Large Language Models (LLMs) and their advanced reasoning capabilities, these agents are now able to handle more complex or previously unseen tasks. However, LLM-based automation techniques in recent literature frequently rely on HTML source code for input or application-specific API calls for actions, limiting their applicability to specific environments. We propose an LLM-based agent that mimics human behavior in solving computer tasks. It perceives its environment solely through screenshot images, which are then converted into text for an LLM to process. By leveraging the reasoning capability of the LLM, we eliminate the need for large-scale human demonstration data typically required for model training. The agent only executes keyboard and mouse operations on Graphical User Interface (GUI), removing the need for pre-provided APIs to function. To further enhance the agent's performance in this setting, we propose a novel prompting strategy called Context-Aware Action Planning (CAAP) prompting, which enables the agent to thoroughly examine the task context from multiple perspectives. Our agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop, outperforming all previous studies of agents that rely solely on screen images. This method demonstrates potential for broader applications, particularly for tasks requiring coordination across multiple applications on desktops or smartphones, marking a significant advancement in the field of automation agents. Codes and models are accessible at <a class="link-external link-https" href="https://github.com/caap-agent/caap-agent" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Human-Computer Interaction
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address some key challenges in computer task automation. Specifically, it proposes an agent model based on large language models (LLM) that solves computer tasks through a front-end user interface (UI). The main issues include: 1. **Dependency on Specific Environments**: Existing automation technologies often rely on HTML source code or specific application API calls, which limits their applicability in different environments. 2. **High Data Requirements**: Many existing methods require a large amount of expert demonstration data for training, which is not only time-consuming but also costly. 3. **Lack of Flexibility**: Traditional RPA tools use rule-based algorithms, which are difficult to handle unforeseen situations and anomalies, which are very common in actual desktop tasks. To address these issues, the paper proposes a new agent model that perceives the environment only through screenshot images and performs tasks through keyboard and mouse operations. Additionally, the paper introduces a new prompting strategy called "Context-Aware Action Planning (CAAP)" to enhance the agent's performance in handling complex tasks. ### Main Contributions 1. **First LLM-based Agent Model**: This model uses the front-end UI as the input and output source, eliminating the need for large-scale human demonstration data through image-to-text conversion and LLM processing. 2. **CAAP Prompting Technique**: This technique significantly enhances the agent's decision-making ability by systematically organizing contextual information and utilizing syntax structures that trigger optimal chain-of-thought (CoT). 3. **Modular Architecture**: The agent model is divided into three modules: visual observer, decision-maker, and action executor. Each module can be independently updated, improving the system's scalability and parallel processing capabilities. ### Experimental Results - **MiniWoB++ Benchmark**: The agent achieved an average success rate of 94.5% across 73 tasks, surpassing all existing methods that rely solely on image input. - **WebShop Benchmark**: The agent achieved a task score of 62.3 on 500 test instructions, significantly outperforming Pix2Act's score of 46.7. These results indicate that the proposed method is highly efficient and adaptable in handling various computer tasks, especially those requiring coordination across multiple applications.