RealWeb: A Benchmark for Universal Instruction Following in Realistic Web Services Navigation
Bolin Zhang,Shiyun Xiong,Dianbo Sui,Yunzhe Xu,Zhiying Tu,Dianhui Chu
DOI: https://doi.org/10.1109/icws62655.2024.00056
2024-01-01
Abstract:Traditional methods of interacting with web pages, such as clicking and scrolling, greatly hinder users, especially those with disabilities and the elderly from conveniently accessing web services. By following user instructions, automatic web service navigation agents accomplish complex tasks on the websites, which is a natural interactions with web services. To study this task, previous works constructed simple web pages within simulated environments, but the realistic websites are far more intricate in true environments. Moreover, existing methods for this task rely on manually collecting human demonstrations on the given websites, which is time-consuming and labor-intensive, and reduce the generalization ability of service agents to unseen websites. Thus, we construct the first Chinese multimodal benchmark for web services navigation under the realistic settings: across domains and without human demonstrations. Our benchmark comprises a dataset (RealWeb) and a baseline method (WeServe). RealWeb consists of 40 real-world websites, 110 pages, and 11,739 language instructions. To detect and understand the feasible operations of pages in the visual mode, the screenshot of each page is annotated with 5 critical areas in RealWeb. WeServeis a multi-modal framework for web services navigation that combines visual and textual information, enabling universal navigation on any web pages with a success rate of 68.61%. Our dataset and codes are available at https://gitee.com/plabrolin/real-web.