Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps

Zhe Liu,Cheng Li,Chunyang Chen,Junjie Wang,Mengzhuo Chen,Boyu Wu,Yawen Wang,Jun Hu,Qing Wang
2024-12-04
Abstract:Mobile app GUI (Graphical User Interface) pages now contain rich visual information, with the visual semantics of each page helping users understand the application logic. However, these complex visual and functional logic present new challenges to software testing. Existing automated GUI testing methods, constrained by the lack of reliable testing oracles, are limited to detecting crash bugs with obvious abnormal signals. Consequently, many non-crash functional bugs, ranging from unexpected behaviors to logical errors, often evade detection by current techniques. While these non-crash functional bugs can exhibit visual cues that serve as potential testing oracles, they often entail a sequence of screenshots, and detecting them necessitates an understanding of the operational logic among GUI page transitions, which is challenging traditional techniques. Considering the remarkable performance of Multimodal Large Language Models (MLLM) in visual and language understanding, this paper proposes Trident, a novel vision-driven, multi-agent collaborative automated GUI testing approach for detecting non-crash functional bugs. It comprises three agents: Explorer, Monitor, and Detector, to guide the exploration, oversee the testing progress, and spot issues. We also address several challenges, i.e., align visual and textual information for MLLM input, achieve functionality-oriented exploration, and infer test oracles for non-crash bugs, to enhance the performance of functionality bug detection. We evaluate Trident on 590 non-crash bugs and compare it with 12 baselines, it can achieve more than 14%-112% and 108%-147% boost in average recall and precision compared with the best baseline. The ablation study further proves the contribution of each module. Moreover, Trident identifies 43 new bugs on Google Play, of which 31 have been fixed.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to effectively detect non - crash functional bugs in mobile applications. Specifically, existing automated GUI testing methods mainly focus on detecting crash bugs because these bugs have obvious abnormal signals and are easy to identify. However, many non - crash functional bugs (such as unexpected behaviors and logical errors) are often difficult to be detected by existing technologies due to the lack of reliable testing standards. Although these bugs will not cause the application to crash, they will affect the functionality and user experience of the application, and may even lead to serious consequences. ### Main Problems and Challenges 1. **Combination of Vision and Functional Logic**: - The graphical user interface (GUI) pages of mobile applications contain rich visual information, which can help users understand the logic of the application. However, the complex visual and functional logic brings new challenges to software testing. - Existing automated GUI testing methods are limited by the lack of reliable testing standards and can only detect crash bugs with obvious abnormal signals. Therefore, many non - crash functional bugs often escape detection. 2. **Utilization of Visual Clues**: - Many non - crash functional bugs will show clear visual clues, such as component occlusion, text overlap, etc. Detecting these bugs requires an understanding of the operation logic between GUI pages, which poses a challenge to traditional techniques. 3. **Requirement for Automated Detection**: - Although manual testing can identify these bugs, it is time - consuming and costly and cannot be applied on a large scale. Researchers have begun to explore the use of visual information to detect non - crash or functional bugs, but these methods are usually limited to specific types of bugs and rely on rule sets or a large amount of labeled data, and it is difficult to adapt to the diverse mobile application testing requirements. ### Solutions To solve the above problems, the author proposes a new vision - driven, multi - agent collaborative automated GUI testing method based on the Multimodal Large Language Model (MLLM) - Trident. Trident consists of three agents: - **Explorer Agent**: Responsible for navigating the application, capturing the view hierarchy and screenshots, and guiding the exploration of different GUI pages, focusing on the functionality of the application. - **Monitor Agent**: Supervises the testing process, records the exploration history, and triggers the Detector Agent at the appropriate time. - **Detector Agent**: Identifies potential functional bugs by examining the logical transitions in the GUI page changes. In addition, the author also addresses the following challenges: 1. **Alignment of Visual and Text Information**: Developed screenshot annotation methods and alignment methods to make the MLLM better understand the GUI page content. 2. **Function - Oriented Exploration**: Designed methods to infer and abstract the current function from the detailed exploration sequence to avoid exceeding the token limit and enhance the detection of functional bugs. 3. **Inference of Testing Standards**: Developed mechanisms to determine when to trigger the Detector Agent, and designed a function - aware inference chain method to enable the MLLM to explicitly infer testing standards and detect functional bugs based on these inferences. Through these improvements, Trident has shown significant performance improvement in detecting non - crash functional bugs.