Abstract:This paper introduces a novel mobile phone control architecture, termed ``app agents", for efficient interactions and controls across various Android apps. The proposed Lightweight Multi-modal App Control (LiMAC) takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, within LiMAC, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of efficiency and resource limitations faced by smartphone application agents (app agents) when executing user instructions. Specifically, the author proposes a new mobile device control architecture - **Lightweight Multimodal Application Control (LiMAC)** to achieve efficient interaction and control across multiple Android applications. #### Main problems: 1. **Limited computing resources**: Smartphones have limited computing power and memory resources, resulting in slow and costly execution of tasks by existing application agents based on large - scale foundation models (such as GPT - 4o). 2. **Requirement for real - time decision - making**: In order to achieve real - time task execution, a system that can make quick decisions under limited resources is required. 3. **Complex task processing**: In addition to simple click and scroll operations, some tasks also require natural language understanding and text generation capabilities, such as sending messages or querying search engines. #### Solutions: To solve the above problems, the author proposes the following methods: 1. **Lightweight Transformer network (Action Transformer, AcT)**: It is used to predict action types and handle most tasks that do not require complex natural language understanding. AcT predicts the target UI elements of click operations through a contrastive learning objective. 2. **Fine - tuned Vision - Language Model (VLM)**: For tasks that require natural language understanding and text generation (such as inputting text or opening applications), the fine - tuned VLM is used for processing. This enables the system to be lightweight while also being able to handle complex text tasks. 3. **Hybrid architecture**: It combines the advantages of the lightweight Transformer and the fine - tuned VLM to achieve efficient real - time decision - making and task execution. #### Experimental results: - **Performance improvement**: Compared with existing methods based on large - scale foundation models, LiMAC significantly improves task execution time and accuracy on two open - source mobile control datasets. Specifically, LiMAC increases the overall action accuracy by up to 19% and by 42% compared to the prompt engineering baseline. - **Speed advantage**: The execution time of LiMAC is 30 times faster than existing methods, with an average of only 3 seconds per task. Through these improvements, LiMAC not only improves the accuracy and speed of task execution but also significantly reduces the demand for computing resources, making it more suitable for deployment on mobile devices such as smartphones.

Lightweight Neural App Control

AppAgent: Multimodal Agents as Smartphone Users

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

MobileAgent: enhancing mobile control via human-machine interaction and SOP integration

CAMPHOR: Collaborative Agents for Multi-input Planning and High-Order Reasoning On Device

CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices

LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation

AutoDroid: LLM-powered Task Automation in Android

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Rethinking Mobile AI Ecosystem in the LLM Era

Training a Vision Language Model as Smartphone Assistant

Empowering LLM to use Smartphone for Intelligent Task Automation

VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

TinyAgent: Function Calling at the Edge