Abstract:Addressing the challenge of a digital assistant capable of executing a wide array of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advancements in large language models (LLMs) and present a visual language model (VLM) that can fulfill diverse tasks on mobile devices. Our model functions by interacting solely with the user interface (UI). It uses the visual input from the device screen and mimics human-like interactions, encompassing gestures such as tapping and swiping. This generality in the input and output space allows our agent to interact with any application on the device. Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots along with corresponding actions. Evaluating our method on the challenging Android in the Wild benchmark demonstrates its promising efficacy and potential.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve how to develop a digital assistant that can perform a wide range of user tasks, especially focusing on instruction - based mobile device control. Specifically, the research focuses on achieving intelligent control of mobile devices through Visual - Language Models (VLMs), enabling the model to complete various tasks only by interacting with the user interface (UI). #### Main challenges: 1. **Understanding of natural language instructions**: Traditional user operations rely on direct clicking and gesture operations on the screen, but in many cases, it is more natural and convenient to use natural language to express commands. 2. **Generality across applications**: Many existing methods rely on API calls, but not all applications provide APIs, and integrating multiple APIs presents challenges in terms of training and context length. 3. **Processing of visual information**: Understanding the visual information on the screen and combining it with natural language instructions to generate the correct sequence of operations. #### Solutions: - **Visual - Language Model (VLM)**: Taking advantage of the recent progress in large - language models (LLMs), a visual - language model is proposed. This model can simulate human interaction behaviors, including clicking, swiping and other gestures, through visual input (such as screenshots) and natural language instructions. - **Utilization of historical screenshots**: Different from previous methods, this model not only depends on a single screenshot, but also utilizes the historical information of a series of past screenshots and corresponding actions, so as to better understand the context and make more accurate operation decisions. - **Android in the Wild benchmark test**: By evaluating on the challenging Android in the Wild benchmark data set, the effectiveness and potential of this method are proved. ### Summary: The core problem of this paper is to develop an intelligent assistant that can understand natural language instructions and interact with mobile devices through the user interface, especially achieving this goal through visual - language models in the absence of API support.

Training a Vision Language Model as Smartphone Assistant

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

An Introduction to Vision-Language Modeling

ScreenAgent: A Vision Language Model-driven Computer Control Agent

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

Intelligent Virtual Assistants with LLM-based Process Automation

Enabling Conversational Interaction with Mobile UI using Large Language Models

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Intelligent Agents with LLM-based Process Automation

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

GPTVoiceTasker: LLM-Powered Virtual Assistant for Smartphone

ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Distilling Internet-Scale Vision-Language Models into Embodied Agents

Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques

Yo'LLaVA: Your Personalized Language and Vision Assistant

User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance