Navigating Mobile Testing Evaluation: A Comprehensive Statistical Analysis of Android GUI Testing Metrics
Yuanhong Lan,Yifei Lu,Minxue Pan,Xuandong Li
DOI: https://doi.org/10.1145/3691620.3695476
2024-01-01
Abstract:The prominent role of mobile apps in daily life has underscored the need for robust quality assurance, leading to the development of various automated Android Graphical User Interface (GUI) testing approaches. Code coverage and fault detection are two primary metrics for evaluating the effectiveness of these testing approaches. However, conducting a reliable and robust evaluation based on the two metrics remains challenging, due to the imperfections of the current evaluation system, with a tangle of numerous metric granularities and the interference of multiple nondeterminism in tests. For instance, the evaluation solely based on the mean or total numbers of detected faults lacks statistical robustness, resulting in numerous conflicting conclusions that impede the comprehensive understanding of stakeholders involved in Android testing, thereby hindering the advancement of Android testing methodologies. To mitigate such issues, this paper presents the first comprehensive statistical study of existing Android GUI testing metrics, involving extensive experiments with 8 state-of-the-art testing approaches on 42 diverse apps, examining aspects including statistical significance, correlation, and variation. Our study focuses on two primary areas: (1) The statistical significance and correlation between test metrics and among different metric granularities. (2) The influence of test randomness and test convergence on evaluation results of test metrics. By employing statistical analysis to account for the considerable influence of randomness, we achieve notable findings: (1) Instruction, Executable Lines Of Code (ELOC), and method coverage demonstrate notable consistency across both significance evaluation and mean value evaluation, whereas the evaluation on Fatal Errors compared to Core Vitals, as well as all errors versus the well-selected errors, reveals a similarly high level of consistency. (2) There are evident inconsistencies in the code coverage and fault detection results, indicating both two metrics should be considered for comprehensive evaluation. (3) Code coverage typically exhibits greater stability and robustness in evaluation compared to fault detection, whereas fault detection is quite unstable even with the maximum test rounds ever used in previous research studies. (4) A moderate test duration is sufficient for most approaches to showcase their comprehensive overall effectiveness on most apps in both code coverage and fault detection, indicating the possibility of adopting a moderate test duration to draw preliminary conclusions in Android testing development. These findings inform practical recommendations and support our proposal of an effective framework to enhance future mobile testing evaluations.