Leveraging Large Vision Language Model For Better Automatic Web GUI Testing

Siyi Wang,Sinan Wang,Yujia Fan,Xiaolei Li,Yepang Liu

2024-10-16

Abstract:With the rapid development of web technology, more and more software applications have become web-based in the past decades. To ensure software quality and user experience, various techniques have been proposed to automatically test web applications by interacting with their GUIs. To achieve high functional coverage, web GUI testing tools often need to generate high-quality text inputs and interact with the associated GUI elements (e.g., click submit buttons). However, developing a holistic approach that solves both subtasks is challenging because the web GUI context can be complicated and highly dynamic, which makes it hard to process programmatically. The recent development of large vision-language models (LVLM) provides new opportunities to handle these longstanding problems. This paper proposes VETL, the first LVLM-driven end-to-end web testing technique. With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context, while avoiding the need to extract precise textual attributes. The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element based on visual instructions. Further, the GUI exploration is guided by a multi-armed bandit module employing a curiosity-oriented strategy. Experiments show that VETL effectively explores web state/action spaces and detects bugs. Compared with WebExplor, the state-of-the-art web testing technique, VETL can discover 25% more unique web actions on benchmark websites. Moreover, it can expose functional bugs in top-ranking commercial websites, which the website maintainers have confirmed. Our work makes the first attempt at leveraging LVLM in end-to-end GUI testing, demonstrating promising results in this research direction.

Software Engineering

What problem does this paper attempt to address?

The paper attempts to address the problem of generating high-quality text inputs and selecting relevant GUI elements in automated Web GUI testing to improve test coverage and functionality detection capabilities. Specifically: 1. **Generating high-quality text inputs**: Existing Web GUI testing tools struggle to generate effective and meaningful text inputs, especially for applications requiring specific types of input (e.g., departure and destination in flight booking). These tools typically generate random text inputs, which limits their test coverage. 2. **Selecting relevant GUI elements**: In Web GUI testing, selecting interactive elements related to input fields (such as clicking the submit button) is also a key task. Traditional testing methods find it difficult to capture the logical relationship between input fields and their related elements. 3. **Handling complex Web GUI contexts**: The context of Web GUIs is often very complex and highly dynamic, making programmatic handling difficult. Existing methods often rely on extracting precise text attributes, but in practical applications, these attributes may be unclear or nonexistent. To address these issues, the paper proposes VETL (Vision-Enhanced Testing Loop), an end-to-end Web testing technique based on large vision-language models (LVLM). VETL leverages the scene understanding capabilities of LVLM to generate effective text inputs and selects relevant GUI elements through visual question-answering tasks, thereby improving test coverage and functionality detection capabilities. Experimental results show that VETL performs excellently in exploring Web state space and detecting functional defects, discovering more unique Web actions and functional defects compared to existing Web testing techniques.

Leveraging Large Vision Language Model For Better Automatic Web GUI Testing

Improving web element localization by using a large language model

Model-Enhanced LLM-Driven VUI Testing of VPA Apps

LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation

Software Testing with Large Language Models: Survey, Landscape, and Vision

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

Leveraging Large Language Models for Automated Web-Form-Test Generation: An Empirical Study

Automatic Web Testing using Curiosity-Driven Reinforcement Learning

FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware

Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

AXNav: Replaying Accessibility Tests from Natural Language

On the Evaluation of Large Language Models in Unit Test Generation

Large Language Models for Mobile GUI Text Input Generation: An Empirical Study

Right this way: Can VLMs Guide Us to See More to Answer Questions?

VITAS : Guided Model-based VUI Testing of VPA Apps.

Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study

ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Effective, Platform-Independent GUI Testing via Image Embedding and Reinforcement Learning

TESTEVAL: Benchmarking Large Language Models for Test Case Generation