Abstract:AI Agents are changing the way work gets done, both in consumer and enterprise domains. However, the design patterns and architectures to build highly capable agents or multi-agent systems are still developing, and the understanding of the implication of various design choices and algorithms is still evolving. In this paper, we present our work on building a novel web agent, Agent-E \footnote{Our code is available at \url{<a class="link-external link-https" href="https://github.com/EmergenceAI/Agent-E" rel="external noopener nofollow">this https URL</a>}}. Agent-E introduces numerous architectural improvements over prior state-of-the-art web agents such as hierarchical architecture, flexible DOM distillation and denoising method, and the concept of \textit{change observation} to guide the agent towards more accurate performance. We first present the results of an evaluation of Agent-E on WebVoyager benchmark dataset and show that Agent-E beats other SOTA text and multi-modal web agents on this benchmark in most categories by 10-30\%. We then synthesize our learnings from the development of Agent-E into general design principles for developing agentic systems. These include the use of domain-specific primitive skills, the importance of distillation and de-noising of environmental observations, the advantages of a hierarchical architecture, and the role of agentic self-improvement to enhance agent efficiency and efficacy as the agent gathers experience.

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Limitations of existing Web agents**: Current state-of-the-art Web agents still have many shortcomings in practical applications, especially in terms of efficiency and accuracy when performing complex tasks. These agents often exhibit high error rates and are not as reliable as humans in completing the same tasks. 2. **Design pattern and architecture improvements**: To build a more robust Web agent system, it is necessary to improve existing design patterns and architectures. Particularly when dealing with complex web structures (such as HTML DOM), effective simplification and noise reduction methods are needed to enhance agent performance. 3. **Multimodal processing capabilities**: Although existing research has demonstrated the potential of text and multimodal Web agents in performing diverse tasks on the internet, their practicality still needs improvement, especially in terms of task success rate, task completion time, and cost. To address the above issues, the paper introduces Agent-E, a new generation Web agent capable of performing complex Web tasks. The main contributions of Agent-E include: - Proposing a novel layered architecture that separates task planning from browser navigation functions, enabling more complex task execution. - Designing a flexible DOM extraction method that allows the browser navigation agent to choose the most suitable DOM representation based on task requirements. - Introducing the concept of "change observation," which enhances the agent's understanding of the current environment and execution accuracy by monitoring state changes after each operation. - Achieving significant performance improvements in the WebVoyager benchmark, not only increasing task success rates but also reporting other key metrics for the first time, such as error perception, task completion time, and LLM call counts, providing a benchmark for comprehensive evaluation of future Web agents. Through these improvements, Agent-E demonstrates strong capabilities in performing Web tasks autonomously and provides valuable design principles for building more efficient AI agent systems.

Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems

Waf: an Interface Web Agent Framework

Automated Design of Agentic Systems

A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Agents Are Not Enough

Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset

Design Patterns for Building More Efficient Generative Autonomous Agents: A Survey

WebArena: A Realistic Web Environment for Building Autonomous Agents

Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

Agency plus automation: Designing artificial intelligence into interactive systems

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Infogent: An Agent-Based Framework for Web Information Aggregation

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Agent AI: Surveying the Horizons of Multimodal Interaction

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

A Design Framework of Exploration, Segmentation, Navigation, and Instruction (ESNI) for the Lifecycle of Intelligent Mobile Agents as a Method for Mapping an Unknown Built Environment

AUTONOMOUS AGENTS AS EMBODIED AI

Building Intelligent Autonomous Navigation Agents