Training a Vision Language Model as Smartphone Assistant

Nicolai Dorka,Janusz Marecki,Ammar Anwar
2024-04-13
Abstract:Addressing the challenge of a digital assistant capable of executing a wide array of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advancements in large language models (LLMs) and present a visual language model (VLM) that can fulfill diverse tasks on mobile devices. Our model functions by interacting solely with the user interface (UI). It uses the visual input from the device screen and mimics human-like interactions, encompassing gestures such as tapping and swiping. This generality in the input and output space allows our agent to interact with any application on the device. Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots along with corresponding actions. Evaluating our method on the challenging Android in the Wild benchmark demonstrates its promising efficacy and potential.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition,Human-Computer Interaction
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve how to develop a digital assistant that can perform a wide range of user tasks, especially focusing on instruction - based mobile device control. Specifically, the research focuses on achieving intelligent control of mobile devices through Visual - Language Models (VLMs), enabling the model to complete various tasks only by interacting with the user interface (UI). #### Main challenges: 1. **Understanding of natural language instructions**: Traditional user operations rely on direct clicking and gesture operations on the screen, but in many cases, it is more natural and convenient to use natural language to express commands. 2. **Generality across applications**: Many existing methods rely on API calls, but not all applications provide APIs, and integrating multiple APIs presents challenges in terms of training and context length. 3. **Processing of visual information**: Understanding the visual information on the screen and combining it with natural language instructions to generate the correct sequence of operations. #### Solutions: - **Visual - Language Model (VLM)**: Taking advantage of the recent progress in large - language models (LLMs), a visual - language model is proposed. This model can simulate human interaction behaviors, including clicking, swiping and other gestures, through visual input (such as screenshots) and natural language instructions. - **Utilization of historical screenshots**: Different from previous methods, this model not only depends on a single screenshot, but also utilizes the historical information of a series of past screenshots and corresponding actions, so as to better understand the context and make more accurate operation decisions. - **Android in the Wild benchmark test**: By evaluating on the challenging Android in the Wild benchmark data set, the effectiveness and potential of this method are proved. ### Summary: The core problem of this paper is to develop an intelligent assistant that can understand natural language instructions and interact with mobile devices through the user interface, especially achieving this goal through visual - language models in the absence of API support.