Abstract:The prominent role of mobile apps in daily life has underscored the need for robust quality assurance, leading to the development of various automated Android Graphical User Interface (GUI) testing approaches. Code coverage and fault detection are two primary metrics for evaluating the effectiveness of these testing approaches. However, conducting a reliable and robust evaluation based on the two metrics remains challenging, due to the imperfections of the current evaluation system, with a tangle of numerous metric granularities and the interference of multiple nondeterminism in tests. For instance, the evaluation solely based on the mean or total numbers of detected faults lacks statistical robustness, resulting in numerous conflicting conclusions that impede the comprehensive understanding of stakeholders involved in Android testing, thereby hindering the advancement of Android testing methodologies. To mitigate such issues, this paper presents the first comprehensive statistical study of existing Android GUI testing metrics, involving extensive experiments with 8 state-of-the-art testing approaches on 42 diverse apps, examining aspects including statistical significance, correlation, and variation. Our study focuses on two primary areas: (1) The statistical significance and correlation between test metrics and among different metric granularities. (2) The influence of test randomness and test convergence on evaluation results of test metrics. By employing statistical analysis to account for the considerable influence of randomness, we achieve notable findings: (1) Instruction, Executable Lines Of Code (ELOC), and method coverage demonstrate notable consistency across both significance evaluation and mean value evaluation, whereas the evaluation on Fatal Errors compared to Core Vitals, as well as all errors versus the well-selected errors, reveals a similarly high level of consistency. (2) There are evident inconsistencies in the code coverage and fault detection results, indicating both two metrics should be considered for comprehensive evaluation. (3) Code coverage typically exhibits greater stability and robustness in evaluation compared to fault detection, whereas fault detection is quite unstable even with the maximum test rounds ever used in previous research studies. (4) A moderate test duration is sufficient for most approaches to showcase their comprehensive overall effectiveness on most apps in both code coverage and fault detection, indicating the possibility of adopting a moderate test duration to draw preliminary conclusions in Android testing development. These findings inform practical recommendations and support our proposal of an effective framework to enhance future mobile testing evaluations.

Compatibility Testing Service for Mobile Applications

Mobile Application Testing: A Tutorial

A Test Automation Solution On Gui Functional Test

Towards Scalable Automated Mobile App Testing

Testing and Fault Diagnosis for Web Application Compatibility Based on Combinatorial Method.

Navigating Mobile Testing Evaluation: A Comprehensive Statistical Analysis of Android GUI Testing Metrics

Applying Combinatorial Method to Test Browser Compatibility

Continuous, Evolutionary and Large-Scale: A New Perspective for Automated Mobile App Testing

Mobile Testing-as-a-Service (MTaaS) -- Infrastructures, Issues, Solutions and Needs

Human-Machine Collaborative Testing for Android Applications

Facilitating Reusable and Scalable Automated Testing and Analysis for Android Apps.

Taming Android Fragmentation through Lightweight Crowdsourced Testing

Towards Comprehensive Evaluation for Android Automated Testing Tools

An Automated Testing Platform for Mobile Applications

Automated Visual Testing for Mobile Apps in an Industrial Setting

Successes, Challenges, and Rethinking – an Industrial Investigation on Crowdsourced Mobile Application Testing

A Browser Compatibility Testing Method Based On Combinatorial Testing

To the Attention of Mobile Software Developers: Guess What, Test your App!

Towards the quality improvement of cross-platform mobile applications

Take the Blue Pill: Pursuing Mobile App Testing Fidelity, Efficiency, and Accessibility with Virtual Device Farms

Systematically Testing Background Services of Mobile Apps