Abstract:GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of traditional graphical user interface (GUI) automation methods when dealing with dynamic and complex tasks. Traditional automation methods mainly rely on script - or rule - driven approaches. Although these methods perform well in fixed work - flows, they lack flexibility and adaptability and are difficult to cope with the diversity and changes in the real world. With the development of large - language models (LLMs), especially multimodal models, these models have shown remarkable capabilities in natural - language understanding, code generation, task generalization, and visual processing, making new GUI automation methods possible. Specifically, the paper focuses on the following core issues: 1. **Existing GUI Agent Frameworks**: Explore what the mainstream GUI agent frameworks are currently, and what their characteristics and applicable scenarios are. 2. **Data Collection and Utilization**: Discuss how to collect and use data to train specialized GUI agents, especially the construction of data sets in a multi - platform environment. 3. **Development of Large - scale Action Models (LAMs)**: Research how to use the collected data to train large - scale action models for GUI tasks and introduce the leading models in the current field. 4. **Evaluation Metrics and Benchmarks**: Propose key metrics and benchmark testing methods for evaluating the effectiveness of GUI agents. 5. **Practical Applications**: Demonstrate practical application cases of LLM - driven GUI agents, such as web navigation, mobile - application interaction, and desktop automation. 6. **Future Research Directions**: Identify the key gaps in current research and propose future research directions and development roadmaps. By systematically reviewing these aspects, the paper aims to provide a comprehensive guide for researchers and practitioners to help them overcome challenges and fully utilize the potential of LLM - driven GUI agents.

Large Language Model-Brained GUI Agents: A Survey

GUI Agents with Foundation Models: A Comprehensive Survey

Large Multimodal Agents: A Survey

The Rise and Potential of Large Language Model Based Agents: A Survey

CogAgent: A Visual Language Model for GUI Agents

Large Language Model Powered Agents in the Web

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

A survey on large language model based autonomous agents

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

ScreenAgent: A Vision Language Model-driven Computer Control Agent

A Survey on Large Language Model-Based Game Agents

Large Language Model-Based Agents for Software Engineering: A Survey

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects

A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges

MobileFlow: A Multimodal LLM For Mobile GUI Agent

Human-Centered LLM-Agent User Interface: A Position Paper

GUICourse: From General Vision Language Models to Versatile GUI Agents