GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Quanfeng Lu,Wenqi Shao,Zitao Liu,Fanqing Meng,Boxuan Li,Botong Chen,Siyuan Huang,Kaipeng Zhang,Yu Qiao,Ping Luo

2024-06-13

Abstract:Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms. Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising simple tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we introduce GUI Odyssey, a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos. Leveraging GUI Odyssey, we developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module. Extensive experiments demonstrate OdysseyAgent's superior accuracy compared to existing models. For instance, OdysseyAgent surpasses fine-tuned Qwen-VL and zero-shot GPT-4V by 1.44\% and 55.49\% in-domain accuracy, and 2.29\% and 48.14\% out-of-domain accuracy on average. The dataset and code will be released in \url{<a class="link-external link-https" href="https://github.com/OpenGVLab/GUI-Odyssey" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper focuses on the problem of cross-application Graphical User Interface (GUI) navigation. In situations where smartphone users frequently need to switch between multiple applications to complete tasks (such as sharing content between social media platforms), autonomous GUI navigation agents can enhance user experience by optimizing workflow and reducing manual intervention. However, existing GUI navigation agents are often trained only for simple tasks within a single application, resulting in poor performance during cross-application navigation. To address this issue, the paper introduces GUI Odyssey, a comprehensive dataset for training and evaluating cross-application navigation agents. The dataset consists of 7735 navigation sequences from 6 mobile devices, covering 6 types of cross-application tasks, 201 applications, and 1400 application combinations. Utilizing GUI Odyssey, the researchers developed OdysseyAgent, a multimodal cross-application navigation agent implemented by fine-tuning the Qwen-VL model and adding a history resampling module. Experimental results demonstrate that OdysseyAgent exhibits higher accuracy compared to existing models, both on in-domain and out-of-domain test sets. Furthermore, the paper discusses the challenges in constructing such datasets, including task diversity and the difficulty of annotating cross-application consistency. Overall, the goal of this paper is to advance the development of more accurate cross-application GUI navigation agents to adapt to complex multi-application interaction scenarios in the real world. It provides a rich dataset and a competitive agent model for this purpose.

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

GUICourse: From General Vision Language Models to Versatile GUI Agents

A Pairwise Dataset for GUI Conversion and Retrieval between Android Phones and Tablets

Pairwise GUI Dataset Construction Between Android Phones and Tablets

Falcon-UI: Understanding GUI Before Following User Instructions

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Appaction: Automatic GUI Interaction for Mobile Apps Via Holistic Widget Perception

E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion

CogAgent: A Visual Language Model for GUI Agents

You Only Look at Screens: Multimodal Chain-of-Action Agents

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

MobileViews: A Large-Scale Mobile GUI Dataset

MobileFlow: A Multimodal LLM For Mobile GUI Agent

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

AutoGLM: Autonomous Foundation Agents for GUIs

GUing: A Mobile GUI Search Engine using a Vision-Language Model

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation