Abstract:The prominent role of mobile apps in daily life has underscored the need for robust quality assurance, leading to the development of various automated Android Graphical User Interface (GUI) testing approaches. Code coverage and fault detection are two primary metrics for evaluating the effectiveness of these testing approaches. However, conducting a reliable and robust evaluation based on the two metrics remains challenging, due to the imperfections of the current evaluation system, with a tangle of numerous metric granularities and the interference of multiple nondeterminism in tests. For instance, the evaluation solely based on the mean or total numbers of detected faults lacks statistical robustness, resulting in numerous conflicting conclusions that impede the comprehensive understanding of stakeholders involved in Android testing, thereby hindering the advancement of Android testing methodologies. To mitigate such issues, this paper presents the first comprehensive statistical study of existing Android GUI testing metrics, involving extensive experiments with 8 state-of-the-art testing approaches on 42 diverse apps, examining aspects including statistical significance, correlation, and variation. Our study focuses on two primary areas: (1) The statistical significance and correlation between test metrics and among different metric granularities. (2) The influence of test randomness and test convergence on evaluation results of test metrics. By employing statistical analysis to account for the considerable influence of randomness, we achieve notable findings: (1) Instruction, Executable Lines Of Code (ELOC), and method coverage demonstrate notable consistency across both significance evaluation and mean value evaluation, whereas the evaluation on Fatal Errors compared to Core Vitals, as well as all errors versus the well-selected errors, reveals a similarly high level of consistency. (2) There are evident inconsistencies in the code coverage and fault detection results, indicating both two metrics should be considered for comprehensive evaluation. (3) Code coverage typically exhibits greater stability and robustness in evaluation compared to fault detection, whereas fault detection is quite unstable even with the maximum test rounds ever used in previous research studies. (4) A moderate test duration is sufficient for most approaches to showcase their comprehensive overall effectiveness on most apps in both code coverage and fault detection, indicating the possibility of adopting a moderate test duration to draw preliminary conclusions in Android testing development. These findings inform practical recommendations and support our proposal of an effective framework to enhance future mobile testing evaluations.

Reproducing Timing-dependent GUI Flaky Tests in Android Apps Via A Single Event Delay

Flaky Test Detection in Android Via Event Order Exploration

Concurrency-related Flaky Test Detection in Android apps

Time-travel Testing of Android Apps

Replaying Harmful Data Races in Android Apps

Record and Replay for Android: Are We There Yet in Industrial Cases?

TimeMachine: Time-travel Testing of Android Apps

Effective Testing of Android Apps Using Extended

An Empirical Analysis of UI-Based Flaky Tests

A Fast Crash Reproduction Method for Android Applications Based On Widget Hierarchy Graphs

A Context-Aware Approach for Dynamic GUI Testing of Android Applications.

Test Reuse Based on Adaptive Semantic Matching Across Android Mobile Applications

Testing Android Apps Via Guided Gesture Event Generation

Effectively Manifesting Concurrency Bugs in Android Apps

A reinforcement learning-based approach to testing GUI of moblie applications

Enhancing Test Reuse with Gui Events Deduplication and Adaptive Semantic Matching

Test code flakiness in mobile apps: The developer's perspective

Navigating Mobile Testing Evaluation: A Comprehensive Statistical Analysis of Android GUI Testing Metrics

ATOM: Automatic Maintenance of GUI Test Scripts for Evolving Mobile Applications

Facilitating Reusable and Scalable Automated Testing and Analysis for Android Apps.