AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang,Zhao Yang,Jiaxuan Liu,Yucheng Han,Xin Chen,Zebiao Huang,Bin Fu,Gang Yu
2023-12-22
Abstract:Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the development of a multimodal intelligent agent framework capable of operating smartphone applications like a human user. Specifically, the researchers aim to overcome the limitations of existing text-based large language models (LLMs) in interacting with the environment by integrating visual capabilities, enabling the intelligent agent to understand and operate various smartphone applications without relying on backend system access. ### Main Issues: 1. **Limitations of Existing LLMs**: Current large language models primarily rely on text information, which limits their ability to perceive and interact with the environment. For example, they cannot understand the icons and operational logic in graphical user interfaces (GUIs). 2. **Challenges in Adapting to New Applications**: Different applications have varying GUIs and are frequently updated, necessitating a method for the intelligent agent to quickly adapt to new or unseen applications. 3. **Difficulty in Data Collection**: Training intelligent agents to perform complex tasks requires a large amount of application demonstration data, but collecting this data is a significant challenge. ### Solutions: 1. **Multimodal Agent Framework**: This framework combines large language models with visual processing capabilities, enabling the intelligent agent to interact with smartphone applications through low-level operations such as clicking and swiping. 2. **Exploration Phase**: The intelligent agent learns the functions and operational logic of applications through autonomous exploration or by observing human demonstrations. These interactions are recorded to form a knowledge base for the agent to reference when performing tasks. 3. **Deployment Phase**: After completing the exploration phase, the intelligent agent utilizes the accumulated knowledge base to perform complex tasks based on the current UI state and task requirements. ### Experimental Validation: The researchers tested 50 tasks across 10 different applications, including social media, email, maps, shopping, and complex image editing tools. The experimental results demonstrate that the framework performs well across various applications, efficiently completing a range of advanced tasks. ### Main Contributions: 1. **Open-Source Multimodal Agent Framework**: Provides a multimodal agent framework capable of operating smartphone applications and releases the source code. 2. **Innovative Exploration Strategy**: Proposes a method for learning application functions through autonomous exploration and observing human demonstrations. 3. **Extensive Experimental Validation**: Validates the framework's effectiveness and adaptability through numerous experiments, showcasing its potential in real-world applications. In summary, this paper aims to address the limitations of existing LLMs in operating smartphone applications by developing a multimodal intelligent agent framework, enabling it to efficiently complete various tasks like a human user.