Abstract:Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: <a class="link-external link-https" href="https://gui-world.github.io/" rel="external noopener nofollow">this https URL</a>.

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

You Only Look at Screens: Multimodal Chain-of-Action Agents

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems

MobileFlow: A Multimodal LLM For Mobile GUI Agent

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Large Language Model-Brained GUI Agents: A Survey

TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities

Multi-task learning with graph attention networks for multi-domain task-oriented dialogue systems

MobileAgent: enhancing mobile control via human-machine interaction and SOP integration

GUI Agents with Foundation Models: A Comprehensive Survey

Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models

GUICourse: From General Vision Language Models to Versatile GUI Agents