How Higher Order Mutant Testing Performs for Deep Learning Models: A Fine-Grained Evaluation of Test Effectiveness and Efficiency Improved from Second-Order Mutant-Classification Tuples

Yanhui Li,Weijun Shen,Tengchao Wu,Lin Chen,Di Wu,Yuming Zhou,Baowen Xu
DOI: https://doi.org/10.1016/j.infsof.2022.106954
IF: 3.9
2022-01-01
Information and Software Technology
Abstract:Context:Given the prevalence of Deep Learning (DL) models in daily life, it is crucial to guarantee their reliability by DL testing. Recently, researchers have adapted mutation testing into DL testing to measure the test power of test sets. The bottleneck of DL mutation testing is the expensive costs of generating a large number of mutants.Objective:We want to study whether the traditional ideology of “Higher Order” and “Strongly Subsuming” in Higher Order Mutant Testing is still applicable for DL mutation testing, i.e., whether they can be used to optimize DL mutation testing by reducing the number of mutants.Method:We propose a new mutation testing framework supporting a fine-grained evaluation of test power, called mutant-classification tuples which consist of mutants and classification categories. Based on mutant-classification tuples, we construct First Order (FOTs) and Higher (Second) Order Tuples (HOTs) by applying mutation operators twice, and search for “Strongly Subsuming” HOTs (SSHOTs) from HOTs.Results:The experimental results conducted on four widely used datasets and five DL model structures tell us that (1) we can find a considerable number of SSHOTs (from 720 to 25,840 in five models) which can greatly reduce the original set of FOTs (with the reduction ratio from 28.69% to 91.97% in our studied DL models). (2) The reduced tuples by SSHOTs can perform very well in test case selection, since the selected test set is almost the same effective (i.e., with almost the same mutation score) and much more efficient (i.e., with a smaller test size, which is more than 50% reduced) for most studied DL models.Conclusions:Our study shows that “Higher Order” and “Strongly Subsuming” are useful to optimize DL mutation testing, i.e., SSHOTs can be introduced to reduce the number of mutants and test cases.
What problem does this paper attempt to address?