Abstract:Imagine a world where AI can handle your work while you sleep - organizing your research materials, drafting a report, or creating a presentation you need for tomorrow. However, while current digital agents can perform simple tasks, they are far from capable of handling the complex real-world work that humans routinely perform. We present PC Agent, an AI system that demonstrates a crucial step toward this vision through human cognition transfer. Our key insight is that the path from executing simple "tasks" to handling complex "work" lies in efficiently capturing and learning from human cognitive processes during computer use. To validate this hypothesis, we introduce three key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently collects high-quality human-computer interaction trajectories with complete cognitive context; (2) a two-stage cognition completion pipeline that transforms raw interaction data into rich cognitive trajectories by completing action semantics and thought processes; and (3) a multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding. Our preliminary experiments in PowerPoint presentation creation reveal that complex digital work capabilities can be achieved with a small amount of high-quality cognitive data - PC Agent, trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications. This demonstrates the data efficiency of our approach, highlighting that the key to training capable digital agents lies in collecting human cognitive data. By open-sourcing our complete framework, including the data collection infrastructure and cognition completion methods, we aim to lower the barriers for the research community to develop truly capable digital agents.

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

You Only Look at Screens: Multimodal Chain-of-Action Agents

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

Aria-UI: Visual Grounding for GUI Instructions

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

TinyClick: Single-Turn Agent for Empowering GUI Automation

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

Agent S: An Open Agentic Framework that Uses Computers Like a Human

The user interface as an agent environment

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

Falcon-UI: Understanding GUI Before Following User Instructions

UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Learning UI Navigation through Demonstrations composed of Macro Actions

From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating UI Operation Impacts

IBOTS: Agent control through the user interface

Object-centric proto-symbolic behavioural reasoning from pixels