Abstract:Multimodal large language models (MLLMs) have enabled LLM-based agents to directly interact with application user interfaces (UIs), enhancing agents' performance in complex tasks. However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Office Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and a fresh UI design principle for application providers in the era of LLMs. It also explores the possibility of turning every applications into agents, paving the way towards an agent-centric operating system (Agent OS).

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the high latency, low reliability, and insufficient UI generalization ability of existing user - interface (UI) - based large - language - model (LLM) agents when completing tasks. Specifically: 1. **High Latency and Long Response Time**: Each individual UI interaction step requires one LLM call to decide which UI to interact with, which can lead to a large amount of time and cost. In addition, the latency of LLM calls is proportional to the number of tokens processed. To ensure that the LLM returns high - quality output, LLM - based UI agents must pass a large amount of UI information to accurately describe the current state, which also increases the latency of each call. 2. **Reliability Issues**: Research shows that LLMs are prone to hallucinations when generating responses. In long - term sequential calls with LLM - based UI agents, the probability of selecting the wrong UI control or hallucinating a non - existent UI for interaction increases with each inference step. Since LLM - based UI agents usually pass the previous interaction history as additional context when inferring the current UI interaction step, hallucinations in the early steps also increase the probability of hallucinations in the later steps. Therefore, when a long UI interaction chain is required, UI agents are more likely to suffer from cumulative errors and encounter task failures. 3. **UI Generalization Challenges**: Although recent research has made progress in UI localization, how LLM - based UI agents handle interactions with application UIs not included in the LLM pre - training stage remains a major obstacle, and there is a lack of effective solutions. To solve these problems, the paper proposes the AXIS framework, a new API - based LLM agent framework, which aims to improve task completion efficiency and reliability by giving priority to API calls rather than multi - step UI interactions. The AXIS framework can automatically explore existing applications, learn from support documents and action trajectories, and construct new APIs to achieve low - latency and high - reliability task execution. Experiments show that AXIS can significantly improve the task completion rate, reduce the user's cognitive burden, and provide application providers with a practical method to transform applications into agents, thus paving the way for the development of a true agent operating system (Agent OS).

Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents

AppAgent: Multimodal Agents as Smartphone Users

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

Human-Centered LLM-Agent User Interface: A Position Paper

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem

Large Language Model-Brained GUI Agents: A Survey

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

LLM Agent Operating System

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Creating an LLM-based AI-agent: A high-level methodology towards enhancing LLMs with APIs

AIOS: LLM Agent Operating System

MobileAgent: enhancing mobile control via human-machine interaction and SOP integration

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Large Language Model Powered Agents in the Web

AI2Apps: A Visual IDE for Building LLM-based AI Agent Applications

Poster: Enabling Agent-centric Interaction on Smartphones with LLM-based UI Reassembling