CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

Xinbei Ma,Zhuosheng Zhang,Hai Zhao

2024-06-02

Abstract:Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation. However, those GUI agents require comprehensive cognition ability including exhaustive perception and reliable action response. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel. Second, CAP decomposes the action prediction into sub-problems: action type prediction and action target conditioned on the action type. With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios. Code is available at <a class="link-external link-https" href="https://github.com/xbmxb/CoCo-Agent" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of automating graphical user interfaces (GUIs) on smartphones. Specifically, it proposes a comprehensive cognitive multimodal language model agent named CoCo-Agent to improve the performance of GUI automation. The paper mainly tackles the following two core issues: 1. **Comprehensive Cognitive Ability**: Current GUI agents need to possess comprehensive cognitive abilities, including detailed perception and reliable action response. The paper proposes a new approach to enhance the agent's cognitive abilities through Comprehensive Environment Perception (CEP) and Conditional Action Prediction (CAP). 2. **Gap Between Existing Visual Modules and GUI Requirements**: Existing visual modules have shortcomings when dealing with GUIs, especially in handling fine-grained information and complex semantic associations. The paper introduces additional tools such as Optical Character Recognition (OCR) to provide detailed layout information to supplement high-level visual perception. Through these technical means, CoCo-Agent achieves state-of-the-art performance on two representative benchmarks (AITW and META-GUI) and demonstrates its potential for application in real-world scenarios.

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

You Only Look at Screens: Multimodal Chain-of-Action Agents

CogAgent: A Visual Language Model for GUI Agents

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

AppAgent: Multimodal Agents as Smartphone Users

CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

Large Language Model-Brained GUI Agents: A Survey

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Building Cooperative Embodied Agents Modularly with Large Language Models

MobileAgent: enhancing mobile control via human-machine interaction and SOP integration

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

ProAgent: Building Proactive Cooperative Agents with Large Language Models

ScreenAgent: A Vision Language Model-driven Computer Control Agent

CGMI: Configurable General Multi-Agent Interaction Framework

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI