Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents

Junting Lu,Zhiyang Zhang,Fangkai Yang,Jue Zhang,Lu Wang,Chao Du,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
2024-09-26
Abstract:Multimodal large language models (MLLMs) have enabled LLM-based agents to directly interact with application user interfaces (UIs), enhancing agents' performance in complex tasks. However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Office Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and a fresh UI design principle for application providers in the era of LLMs. It also explores the possibility of turning every applications into agents, paving the way towards an agent-centric operating system (Agent OS).
Artificial Intelligence
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the high latency, low reliability, and insufficient UI generalization ability of existing user - interface (UI) - based large - language - model (LLM) agents when completing tasks. Specifically: 1. **High Latency and Long Response Time**: Each individual UI interaction step requires one LLM call to decide which UI to interact with, which can lead to a large amount of time and cost. In addition, the latency of LLM calls is proportional to the number of tokens processed. To ensure that the LLM returns high - quality output, LLM - based UI agents must pass a large amount of UI information to accurately describe the current state, which also increases the latency of each call. 2. **Reliability Issues**: Research shows that LLMs are prone to hallucinations when generating responses. In long - term sequential calls with LLM - based UI agents, the probability of selecting the wrong UI control or hallucinating a non - existent UI for interaction increases with each inference step. Since LLM - based UI agents usually pass the previous interaction history as additional context when inferring the current UI interaction step, hallucinations in the early steps also increase the probability of hallucinations in the later steps. Therefore, when a long UI interaction chain is required, UI agents are more likely to suffer from cumulative errors and encounter task failures. 3. **UI Generalization Challenges**: Although recent research has made progress in UI localization, how LLM - based UI agents handle interactions with application UIs not included in the LLM pre - training stage remains a major obstacle, and there is a lack of effective solutions. To solve these problems, the paper proposes the AXIS framework, a new API - based LLM agent framework, which aims to improve task completion efficiency and reliability by giving priority to API calls rather than multi - step UI interactions. The AXIS framework can automatically explore existing applications, learn from support documents and action trajectories, and construct new APIs to achieve low - latency and high - reliability task execution. Experiments show that AXIS can significantly improve the task completion rate, reduce the user's cognitive burden, and provide application providers with a practical method to transform applications into agents, thus paving the way for the development of a true agent operating system (Agent OS).