Lightweight Neural App Control

Filippos Christianos,Georgios Papoudakis,Thomas Coste,Jianye Hao,Jun Wang,Kun Shao
2024-10-23
Abstract:This paper introduces a novel mobile phone control architecture, termed ``app agents", for efficient interactions and controls across various Android apps. The proposed Lightweight Multi-modal App Control (LiMAC) takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, within LiMAC, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.
Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of efficiency and resource limitations faced by smartphone application agents (app agents) when executing user instructions. Specifically, the author proposes a new mobile device control architecture - **Lightweight Multimodal Application Control (LiMAC)** to achieve efficient interaction and control across multiple Android applications. #### Main problems: 1. **Limited computing resources**: Smartphones have limited computing power and memory resources, resulting in slow and costly execution of tasks by existing application agents based on large - scale foundation models (such as GPT - 4o). 2. **Requirement for real - time decision - making**: In order to achieve real - time task execution, a system that can make quick decisions under limited resources is required. 3. **Complex task processing**: In addition to simple click and scroll operations, some tasks also require natural language understanding and text generation capabilities, such as sending messages or querying search engines. #### Solutions: To solve the above problems, the author proposes the following methods: 1. **Lightweight Transformer network (Action Transformer, AcT)**: It is used to predict action types and handle most tasks that do not require complex natural language understanding. AcT predicts the target UI elements of click operations through a contrastive learning objective. 2. **Fine - tuned Vision - Language Model (VLM)**: For tasks that require natural language understanding and text generation (such as inputting text or opening applications), the fine - tuned VLM is used for processing. This enables the system to be lightweight while also being able to handle complex text tasks. 3. **Hybrid architecture**: It combines the advantages of the lightweight Transformer and the fine - tuned VLM to achieve efficient real - time decision - making and task execution. #### Experimental results: - **Performance improvement**: Compared with existing methods based on large - scale foundation models, LiMAC significantly improves task execution time and accuracy on two open - source mobile control datasets. Specifically, LiMAC increases the overall action accuracy by up to 19% and by 42% compared to the prompt engineering baseline. - **Speed advantage**: The execution time of LiMAC is 30 times faster than existing methods, with an average of only 3 seconds per task. Through these improvements, LiMAC not only improves the accuracy and speed of task execution but also significantly reduces the demand for computing resources, making it more suitable for deployment on mobile devices such as smartphones.