Abstract:Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 1B model (i.e. AUTO-UI-base) on our AitZ dataset achieves on-par performance with CogAgent-Chat-18B.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to more effectively predict a sequence of API operation sequences when autonomously completing tasks through natural language instructions on the smartphone graphical user interface (GUI). Existing research usually ignores the semantic information carried by intermediate screenshots and screen operations, which leads to a high dependence on past operations and visual observations during task execution. To solve this problem, the paper proposes the **Chain - of - Action - Thought (CoAT)** model. This model not only considers past action descriptions and the current screen, but more importantly, it also considers the actions that should be performed and their results. ### Core contributions of the paper: 1. **Proposing the CoAT model**: - **Screen Description (SD)**: Describe the main content of a given screenshot, including the screen type and the mainly displayed application or widget. - **Action Think (AT)**: Analyze the user query and the current screen, and combine historical information to infer possible actions that are helpful to achieve the goal. - **Next Action Description (AD)**: Describe the UI element or screen function to be operated soon. - **Action Result (AR)**: Synthesize the action result by comparing the screenshots before and after the action, and connect the current screen and future observations. 2. **Constructing the Android - In - The - Zoo (AITZ) dataset**: - AITZ is the first and largest fine - grained Android GUI navigation dataset, containing 2,504 unique instructions and 18,643 screen - action pairs, as well as four types of semantic annotations, covering more than 70 Android applications. 3. **Verifying the effectiveness of CoAT**: - Through zero - shot and fine - tuning evaluations, the effectiveness and necessity of CoAT in improving the accuracy of action prediction are verified. The experimental results show that CoAT significantly improves the goal progress and learning efficiency of the GUI agent. ### Specific problem analysis: - **Zero - shot evaluation**: - Use the CogAgent model for zero - shot evaluation, and the results show that CoAT significantly improves the overall model performance. - Especially for complex click (CLICK) and input (TYPE) actions, CoAT performs excellently in terms of action type prediction accuracy. - **Fine - tuning evaluation**: - Through ablation studies by alternately introducing different components of CoAT, it is found that the previous action result (Previous Action Result) combined with action think (Action Think) and action description (Next Action Description) significantly improves the accuracy of action prediction. - The experimental results also show that when adding screen description (Screen Description), although the action matching score decreases slightly, it still improves the model performance overall. - **Qualitative analysis**: - Through detailed analysis of error cases, it is found that CoAT enhances the consistency of the decision - making process by explicitly describing the result of the previous action. - For the CogAgent model, adding action think (Action Think) to the model input helps to alleviate the problem of repeated and invalid actions. In conclusion, through proposing the CoAT model and constructing the AITZ dataset, this paper significantly improves the action prediction ability and decision - making efficiency of the GUI agent when autonomously completing tasks.

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

You Only Look at Screens: Multimodal Chain-of-Action Agents

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

AutoAct: Automatic Agent Learning from Scratch for QA Via Self-Planning

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

CogAgent: A Visual Language Model for GUI Agents

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

MobileAgent: enhancing mobile control via human-machine interaction and SOP integration

Large Language Model-Brained GUI Agents: A Survey

AutoDroid: LLM-powered Task Automation in Android

AppAgent: Multimodal Agents as Smartphone Users

LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

AutoGLM: Autonomous Foundation Agents for GUIs

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation