Abstract:As our world digitizes, web agents that can automate complex and monotonous tasks are becoming essential in streamlining workflows. This paper introduces an approach to improving web agent performance through multi-modal validation and self-refinement. We present a comprehensive study of different modalities (text, vision) and the effect of hierarchy for the automatic validation of web agents, building upon the state-of-the-art Agent-E web automation framework. We also introduce a self-refinement mechanism for web automation, using the developed auto-validator, that enables web agents to detect and self-correct workflow failures. Our results show significant gains on Agent-E's (a SOTA web agent) prior state-of-art performance, boosting task-completion rates from 76.2\% to 81.24\% on the subset of the WebVoyager benchmark. The approach presented in this paper paves the way for more reliable digital assistants in complex, real-world scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the performance of web agents in automating complex and monotonous tasks through multimodal verification and self - correction mechanisms. Specifically, the authors introduce a method aimed at improving the performance of web agents through multimodal verification and self - correction. They study the influence of different modalities (text, vision) and hierarchical structures on the automatic verification of web agents, and on this basis propose a self - correction mechanism that enables web agents to detect and self - correct errors in the workflow. ### Main contributions: 1. **Multimodal validator**: Provides a comprehensive understanding of how to use different modalities to build reliable, task - independent validators for web workflows. This includes using the hierarchical structure of the agent to extract information and then input it into the validator. 2. **Self - correction mechanism**: Shows that without additional human supervision, self - correction can achieve the current best web agent performance. This proves that agents can improve the performance of web navigation through self - verification. 3. **Practical problems**: Points out some practical problems that may be encountered when implementing web agents. These findings emphasize the complexity of the real - world web environment and provide valuable insights for future research. ### Method overview: - **Problem setting**: Regard the web as a Markov decision process (MDP), defined as a tuple \((S, A, P, R, \gamma)\), where \(S\) represents web pages, \(A\) represents user operations (such as clicking on links), \(P\) includes the main deterministic state transitions, \(R\) quantifies the reward function, and \(\gamma\) models the exploration depth. - **Verification method**: Constructs a domain - independent web validator that can be used as a reward signal in any web navigation environment. The validator uses text, vision, multimodal, and hierarchical task summaries to improve the verification effect. - **Self - correction**: By integrating the validator, the web agent can adjust the strategy according to the feedback and retry the task when the task is not completed. ### Experimental results: - **Validator performance**: The text - log validator performs the best, with an accuracy rate of 84.24%, followed closely by the multimodal validator with an accuracy rate of 83%. The visual validator performs relatively poorly, with an accuracy rate of 70.04%. - **Self - correction effect**: On a subset of the WebVoyager benchmark, the self - correction mechanism increased the task completion rate from 76.2% to 81.24%. ### Conclusion: This research develops an effective validator and integrates it into a self - correction mechanism, enabling web agents to detect and self - correct errors in the workflow without additional human supervision. The experimental results show that the task completion rate is significantly improved, from 76.2% to 81.24%, exceeding the previous best performance. Although the overall performance of different modalities is comparable, there are significant differences in the effectiveness of specific tasks, indicating that the best verification modality is task - dependent and requires a specialized validator. Despite facing some technical challenges, especially the problem of screenshot acquisition, the research results emphasize the complexity of the real - world web environment and provide valuable insights for future research.

Multimodal Auto Validation For Self-Refinement in Web Agents

Autonomous Evaluation and Refinement of Digital Agents

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations

Large Language Models Can Self-Improve At Web Agent Tasks

AutoAct: Automatic Agent Learning from Scratch for QA Via Self-Planning

Multi-agent system based autonomic computing environment

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

WebSuite: Systematically Evaluating Why Web Agents Fail

Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

You Only Look at Screens: Multimodal Chain-of-Action Agents

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning

WebCanvas: Benchmarking Web Agents in Online Environments