Multimodal Auto Validation For Self-Refinement in Web Agents

Ruhana Azam,Tamer Abuelsaad,Aditya Vempaty,Ashish Jagmohan
2024-10-11
Abstract:As our world digitizes, web agents that can automate complex and monotonous tasks are becoming essential in streamlining workflows. This paper introduces an approach to improving web agent performance through multi-modal validation and self-refinement. We present a comprehensive study of different modalities (text, vision) and the effect of hierarchy for the automatic validation of web agents, building upon the state-of-the-art Agent-E web automation framework. We also introduce a self-refinement mechanism for web automation, using the developed auto-validator, that enables web agents to detect and self-correct workflow failures. Our results show significant gains on Agent-E's (a SOTA web agent) prior state-of-art performance, boosting task-completion rates from 76.2\% to 81.24\% on the subset of the WebVoyager benchmark. The approach presented in this paper paves the way for more reliable digital assistants in complex, real-world scenarios.
Artificial Intelligence,Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the performance of web agents in automating complex and monotonous tasks through multimodal verification and self - correction mechanisms. Specifically, the authors introduce a method aimed at improving the performance of web agents through multimodal verification and self - correction. They study the influence of different modalities (text, vision) and hierarchical structures on the automatic verification of web agents, and on this basis propose a self - correction mechanism that enables web agents to detect and self - correct errors in the workflow. ### Main contributions: 1. **Multimodal validator**: Provides a comprehensive understanding of how to use different modalities to build reliable, task - independent validators for web workflows. This includes using the hierarchical structure of the agent to extract information and then input it into the validator. 2. **Self - correction mechanism**: Shows that without additional human supervision, self - correction can achieve the current best web agent performance. This proves that agents can improve the performance of web navigation through self - verification. 3. **Practical problems**: Points out some practical problems that may be encountered when implementing web agents. These findings emphasize the complexity of the real - world web environment and provide valuable insights for future research. ### Method overview: - **Problem setting**: Regard the web as a Markov decision process (MDP), defined as a tuple \((S, A, P, R, \gamma)\), where \(S\) represents web pages, \(A\) represents user operations (such as clicking on links), \(P\) includes the main deterministic state transitions, \(R\) quantifies the reward function, and \(\gamma\) models the exploration depth. - **Verification method**: Constructs a domain - independent web validator that can be used as a reward signal in any web navigation environment. The validator uses text, vision, multimodal, and hierarchical task summaries to improve the verification effect. - **Self - correction**: By integrating the validator, the web agent can adjust the strategy according to the feedback and retry the task when the task is not completed. ### Experimental results: - **Validator performance**: The text - log validator performs the best, with an accuracy rate of 84.24%, followed closely by the multimodal validator with an accuracy rate of 83%. The visual validator performs relatively poorly, with an accuracy rate of 70.04%. - **Self - correction effect**: On a subset of the WebVoyager benchmark, the self - correction mechanism increased the task completion rate from 76.2% to 81.24%. ### Conclusion: This research develops an effective validator and integrates it into a self - correction mechanism, enabling web agents to detect and self - correct errors in the workflow without additional human supervision. The experimental results show that the task completion rate is significantly improved, from 76.2% to 81.24%, exceeding the previous best performance. Although the overall performance of different modalities is comparable, there are significant differences in the effectiveness of specific tasks, indicating that the best verification modality is task - dependent and requires a specialized validator. Despite facing some technical challenges, especially the problem of screenshot acquisition, the research results emphasize the complexity of the real - world web environment and provide valuable insights for future research.