Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems

Tamer Abuelsaad,Deepak Akkil,Prasenjit Dey,Ashish Jagmohan,Aditya Vempaty,Ravi Kokku
2024-07-18
Abstract:AI Agents are changing the way work gets done, both in consumer and enterprise domains. However, the design patterns and architectures to build highly capable agents or multi-agent systems are still developing, and the understanding of the implication of various design choices and algorithms is still evolving. In this paper, we present our work on building a novel web agent, Agent-E \footnote{Our code is available at \url{<a class="link-external link-https" href="https://github.com/EmergenceAI/Agent-E" rel="external noopener nofollow">this https URL</a>}}. Agent-E introduces numerous architectural improvements over prior state-of-the-art web agents such as hierarchical architecture, flexible DOM distillation and denoising method, and the concept of \textit{change observation} to guide the agent towards more accurate performance. We first present the results of an evaluation of Agent-E on WebVoyager benchmark dataset and show that Agent-E beats other SOTA text and multi-modal web agents on this benchmark in most categories by 10-30\%. We then synthesize our learnings from the development of Agent-E into general design principles for developing agentic systems. These include the use of domain-specific primitive skills, the importance of distillation and de-noising of environmental observations, the advantages of a hierarchical architecture, and the role of agentic self-improvement to enhance agent efficiency and efficacy as the agent gathers experience.
Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Limitations of existing Web agents**: Current state-of-the-art Web agents still have many shortcomings in practical applications, especially in terms of efficiency and accuracy when performing complex tasks. These agents often exhibit high error rates and are not as reliable as humans in completing the same tasks. 2. **Design pattern and architecture improvements**: To build a more robust Web agent system, it is necessary to improve existing design patterns and architectures. Particularly when dealing with complex web structures (such as HTML DOM), effective simplification and noise reduction methods are needed to enhance agent performance. 3. **Multimodal processing capabilities**: Although existing research has demonstrated the potential of text and multimodal Web agents in performing diverse tasks on the internet, their practicality still needs improvement, especially in terms of task success rate, task completion time, and cost. To address the above issues, the paper introduces Agent-E, a new generation Web agent capable of performing complex Web tasks. The main contributions of Agent-E include: - Proposing a novel layered architecture that separates task planning from browser navigation functions, enabling more complex task execution. - Designing a flexible DOM extraction method that allows the browser navigation agent to choose the most suitable DOM representation based on task requirements. - Introducing the concept of "change observation," which enhances the agent's understanding of the current environment and execution accuracy by monitoring state changes after each operation. - Achieving significant performance improvements in the WebVoyager benchmark, not only increasing task success rates but also reporting other key metrics for the first time, such as error perception, task completion time, and LLM call counts, providing a benchmark for comprehensive evaluation of future Web agents. Through these improvements, Agent-E demonstrates strong capabilities in performing Web tasks autonomously and provides valuable design principles for building more efficient AI agent systems.