Abstract:Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

What problem does this paper attempt to address?

The problem this paper attempts to address is the development of a multimodal intelligent agent framework capable of operating smartphone applications like a human user. Specifically, the researchers aim to overcome the limitations of existing text-based large language models (LLMs) in interacting with the environment by integrating visual capabilities, enabling the intelligent agent to understand and operate various smartphone applications without relying on backend system access. ### Main Issues: 1. **Limitations of Existing LLMs**: Current large language models primarily rely on text information, which limits their ability to perceive and interact with the environment. For example, they cannot understand the icons and operational logic in graphical user interfaces (GUIs). 2. **Challenges in Adapting to New Applications**: Different applications have varying GUIs and are frequently updated, necessitating a method for the intelligent agent to quickly adapt to new or unseen applications. 3. **Difficulty in Data Collection**: Training intelligent agents to perform complex tasks requires a large amount of application demonstration data, but collecting this data is a significant challenge. ### Solutions: 1. **Multimodal Agent Framework**: This framework combines large language models with visual processing capabilities, enabling the intelligent agent to interact with smartphone applications through low-level operations such as clicking and swiping. 2. **Exploration Phase**: The intelligent agent learns the functions and operational logic of applications through autonomous exploration or by observing human demonstrations. These interactions are recorded to form a knowledge base for the agent to reference when performing tasks. 3. **Deployment Phase**: After completing the exploration phase, the intelligent agent utilizes the accumulated knowledge base to perform complex tasks based on the current UI state and task requirements. ### Experimental Validation: The researchers tested 50 tasks across 10 different applications, including social media, email, maps, shopping, and complex image editing tools. The experimental results demonstrate that the framework performs well across various applications, efficiently completing a range of advanced tasks. ### Main Contributions: 1. **Open-Source Multimodal Agent Framework**: Provides a multimodal agent framework capable of operating smartphone applications and releases the source code. 2. **Innovative Exploration Strategy**: Proposes a method for learning application functions through autonomous exploration and observing human demonstrations. 3. **Extensive Experimental Validation**: Validates the framework's effectiveness and adaptability through numerous experiments, showcasing its potential in real-world applications. In summary, this paper aims to address the limitations of existing LLMs in operating smartphone applications by developing a multimodal intelligent agent framework, enabling it to efficiently complete various tasks like a human user.

AppAgent: Multimodal Agents as Smartphone Users

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

MobileAgent: enhancing mobile control via human-machine interaction and SOP integration

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

Empowering LLM to use Smartphone for Intelligent Task Automation

Exploring Smart Agents for the Interaction with Multimodal Mediated Environments

Lightweight Neural App Control

Large Language Model-Brained GUI Agents: A Survey

Poster: Enabling Agent-centric Interaction on Smartphones with LLM-based UI Reassembling

LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

Intelligent Virtual Assistants with LLM-based Process Automation