Leveraging Large Vision Language Model For Better Automatic Web GUI Testing

Siyi Wang,Sinan Wang,Yujia Fan,Xiaolei Li,Yepang Liu
2024-10-16
Abstract:With the rapid development of web technology, more and more software applications have become web-based in the past decades. To ensure software quality and user experience, various techniques have been proposed to automatically test web applications by interacting with their GUIs. To achieve high functional coverage, web GUI testing tools often need to generate high-quality text inputs and interact with the associated GUI elements (e.g., click submit buttons). However, developing a holistic approach that solves both subtasks is challenging because the web GUI context can be complicated and highly dynamic, which makes it hard to process programmatically. The recent development of large vision-language models (LVLM) provides new opportunities to handle these longstanding problems. This paper proposes VETL, the first LVLM-driven end-to-end web testing technique. With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context, while avoiding the need to extract precise textual attributes. The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element based on visual instructions. Further, the GUI exploration is guided by a multi-armed bandit module employing a curiosity-oriented strategy. Experiments show that VETL effectively explores web state/action spaces and detects bugs. Compared with WebExplor, the state-of-the-art web testing technique, VETL can discover 25% more unique web actions on benchmark websites. Moreover, it can expose functional bugs in top-ranking commercial websites, which the website maintainers have confirmed. Our work makes the first attempt at leveraging LVLM in end-to-end GUI testing, demonstrating promising results in this research direction.
Software Engineering
What problem does this paper attempt to address?
The paper attempts to address the problem of generating high-quality text inputs and selecting relevant GUI elements in automated Web GUI testing to improve test coverage and functionality detection capabilities. Specifically: 1. **Generating high-quality text inputs**: Existing Web GUI testing tools struggle to generate effective and meaningful text inputs, especially for applications requiring specific types of input (e.g., departure and destination in flight booking). These tools typically generate random text inputs, which limits their test coverage. 2. **Selecting relevant GUI elements**: In Web GUI testing, selecting interactive elements related to input fields (such as clicking the submit button) is also a key task. Traditional testing methods find it difficult to capture the logical relationship between input fields and their related elements. 3. **Handling complex Web GUI contexts**: The context of Web GUIs is often very complex and highly dynamic, making programmatic handling difficult. Existing methods often rely on extracting precise text attributes, but in practical applications, these attributes may be unclear or nonexistent. To address these issues, the paper proposes VETL (Vision-Enhanced Testing Loop), an end-to-end Web testing technique based on large vision-language models (LVLM). VETL leverages the scene understanding capabilities of LVLM to generate effective text inputs and selects relevant GUI elements through visual question-answering tasks, thereby improving test coverage and functionality detection capabilities. Experimental results show that VETL performs excellently in exploring Web state space and detecting functional defects, discovering more unique Web actions and functional defects compared to existing Web testing techniques.