AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Yuxiang Chai,Siyuan Huang,Yazhe Niu,Han Xiao,Liang Liu,Dingyu Zhang,Peng Gao,Shuai Ren,Hongsheng Li
2024-07-04
Abstract:AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents. Their capabilities of completing complex tasks by directly interacting with the graphical user interface (GUI) on mobile devices are trained and evaluated with the proposed dataset. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. Unlike existing mobile device-control datasets, e.g., MoTIF, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions, each averaging 13 steps with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we develop a baseline model SPHINX Agent and compare its performance across state-of-the-art agents trained on other datasets. To facilitate further research, we open-source our dataset, models, and relevant evaluation tools. The project is available at <a class="link-external link-https" href="https://yuxiangchai.github.io/AMEX/" rel="external noopener nofollow">this https URL</a>
Human-Computer Interaction,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
The paper aims to address the challenges faced by mobile Graphical User Interface (GUI) control agents when handling complex tasks, particularly the inadequacy in dealing with general third-party applications. Specifically, existing mobile GUI control datasets have the following limitations: 1. **Lack of functionality and diversity**: The annotations in existing datasets regarding functional descriptions, element labels, element functionalities, and action details are not rich and diverse enough. 2. **Low data quality**: Some datasets have inaccurate element annotations and poorly aligned bounding boxes. 3. **Issues with representativeness and scale**: Some datasets contain only a small number of instructions for general third-party applications and exhibit data redundancy. 4. **Limited practicality**: They rely on data representations like View Hierarchy, which are not universally available, thus limiting the practical application scope of the datasets. To address these issues, the authors propose the Android Multi-annotation EXpo (AMEX), a comprehensive and large-scale dataset designed specifically for mobile GUI control agents. AMEX has the following features: - **Multi-level annotations**: Including GUI interactive element localization, screen and element functional descriptions, and complex natural language instructions along with their corresponding action chains. - **High quality and diversity**: AMEX ensures the accuracy of element bounding boxes through manual verification and uses GPT to generate screen and element functional descriptions, which are then manually checked to ensure quality. - **Large scale and wide coverage**: It includes over 104,000 high-resolution screenshots from 110 popular applications, annotated with multiple levels of information. - **Real-world applicability**: The instructions and operations in the dataset are more aligned with real-world tasks, with an average of 13 steps per instruction, making it more complex than existing datasets. Additionally, the paper introduces the SPHINX Agent, a baseline model trained on the AMEX dataset. Experimental results show that the SPHINX Agent, trained with the AMEX dataset, performs better in handling complex tasks compared to models trained only on existing datasets, especially in tasks involving general third-party applications. These improvements highlight the significant contribution of the AMEX dataset in enhancing the capabilities of mobile GUI control agents.