Abstract:Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper attempts to address the main challenges faced when building general-purpose Graphical User Interface (GUI) agents, particularly the poor performance of existing open-source Vision-Language Models (VLM) in GUI grounding and handling Out-Of-Distribution (OOD) tasks. Specifically, the paper raises the following issues: 1. **Insufficient Data**: Most existing VLMs are rarely pre-trained on GUI screenshot images, leading to a lack of large-scale, multi-platform open-source GUI screenshot corpora. This limits the model's generalization ability across different platforms. 2. **Data Heterogeneity**: The content and format of existing datasets are inconsistent, and there are conflicts in action naming. For example, the "tap" action on mobile devices and the "click" action on desktop platforms are logically equivalent but are given different names during annotation, causing confusion and performance degradation during model training. 3. **Performance Gap**: There is a significant performance gap between open-source VLMs and closed-source commercial models (such as GPT-4o and GeminiPro-Vision) in GUI grounding and OOD tasks, making researchers reluctant to use open-source models. ### Solutions To address the above issues, the paper proposes OS-Atlas, a foundational action model designed for general-purpose GUI agents. The main contributions include: 1. **Multi-Platform Data Synthesis Tool**: Developed and released the first cross-platform GUI grounding data synthesis tool, supporting the automatic synthesis of GUI grounding data for multiple platforms (Windows, macOS, Linux, Android, and Web), significantly reducing the data collection workload for future research. 2. **Large-Scale Multi-Platform GUI Grounding Corpus**: Using the above tool, compiled and released the largest multi-platform GUI grounding corpus to date, containing over 2.3 million different screenshots and more than 13 million GUI elements. This corpus also includes desktop grounding data not present in previous work and re-annotated the popular benchmark ScreenSpot, releasing ScreenSpot-V2. 3. **Unified Action Space**: By resolving action naming conflicts during training, proposed a unified action space that includes basic actions and custom actions, ensuring the model's generality and consistency across different platforms and applications. 4. **Comprehensive Evaluation**: Conducted the most comprehensive evaluation of OS-Atlas to date, covering six benchmarks across three different platforms: desktop, mobile, and Web. The results show that OS-Atlas significantly outperforms existing state-of-the-art models in multiple benchmarks, indicating its potential as an open-source alternative for developing future GUI agents. ### Summary The paper addresses the performance issues of existing open-source VLMs in GUI grounding and OOD tasks by developing a multi-platform data synthesis tool, constructing a large-scale multi-platform GUI grounding corpus, proposing a unified action space, and conducting comprehensive evaluations. This lays a solid foundation for future research and development of general-purpose GUI agents.

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

GUI Agents with Foundation Models: A Comprehensive Survey

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

AutoGLM: Autonomous Foundation Agents for GUIs

Large Language Model-Brained GUI Agents: A Survey

GUICourse: From General Vision Language Models to Versatile GUI Agents

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

CogAgent: A Visual Language Model for GUI Agents

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

You Only Look at Screens: Multimodal Chain-of-Action Agents

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

AgentStudio: A Toolkit for Building General Virtual Agents

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Ponder & Press: Advancing Visual GUI Agent towards General Computer Control