Abstract:Quality assurance of deep neural networks (DNNs) is crucial for the deployment of DNN-based software, especially in mission- and safety-critical tasks. Inspired by structural white-box testing in traditional software, many test criteria have been proposed to test DNNs, i.e., to exhibit erroneous behaviors by activating new test units that have not been covered, such as new neurons, values, and decision paths. Many studies have been done to evaluate the effectiveness of DNN test coverage criteria. However, existing empirical studies mainly focused on measuring the effectiveness of DNN test criteria for improving the adversarial robustness of DNNs, while ignoring the correctness property when testing DNNs. To fill in this gap, we conduct a comprehensive study on 11 structural coverage criteria, 6 widely-used image datasets, and 9 popular DNNs. We investigate the effectiveness of DNN coverage criteria over natural inputs from 4 aspects: (1) the correlation between test coverage and test diversity; (2) the effects of criteria parameters and target DNNs; (3) the effectiveness to prioritize in-distribution natural inputs that lead to erroneous behaviors; (4) the capability to detect out-of-distribution natural samples. Our findings include: (1) For measuring the diversity, coverage criteria considering the relationship between different neurons are more effective than coverage criteria that treat each neuron independently. For instance, the neuron-path criteria (i.e., SNPC and ANPC) show high correlation with test diversity, which is promising to measure test diversity for DNNs. (2) The hyper-parameters have a big influence on the effectiveness of criteria, especially those relevant to the granularity of test criteria. Meanwhile, the computational complexity is one of the important issues to be considered when designing deep learning test coverage criteria, especially for large-scale models. (3) Test criteria related to data distribution (i.e., LSA and DSA, SNAC, and NBC) can be used to prioritize both in-distribution natural faults and out-of-distribution inputs. Furthermore, for OOD detection, the boundary metrics (i.e., SNAC and NBC) are also effective indicators with lower computational costs and higher detection efficiency compared with LSA and DSA. These findings motivate follow-up research on scalable test coverage criteria that improve the correctness of DNNs.

DeepCNP: an Efficient White-Box Testing of Deep Neural Networks by Aligning Critical Neuron Paths

A White-Box Testing for Deep Neural Networks Based on Neuron Coverage.

There is Limited Correlation Between Coverage and Robustness for Deep Neural Networks

DeepPath: Path-driven Testing Criteria for Deep Neural Networks

DeepHunter: a coverage-guided fuzz testing framework for deep neural networks

NPC: Neuron Path Coverage via Characterizing Decision Logic of Deep Neural Networks

Neuron Sensitivity Guided Test Case Selection for Deep Learning Testing

Test4Deep: an Effective White-Box Testing for Deep Neural Networks

DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks

An Uncovered Neurons Information-Based Fuzzing Method for DNN

TSDTest: A Efficient Coverage Guided Two-Stage Testing for Deep Learning Systems

Robust Black-box Testing of Deep Neural Networks using Co-Domain Coverage

Feature Map Testing for Deep Neural Networks

Excitement Surfeited Turns to Errors: Deep Learning Testing Framework Based on Excitable Neurons

DeepRTest: A Vulnerability-Guided Robustness Testing and Enhancement Framework for Deep Neural Networks.

Can Coverage Criteria Guide Failure Discovery for Image Classifiers? an Empirical Study

Neuron Activation Frequency Based Test Case Prioritization

FuzzGAN: A Generation-Based Fuzzing Framework for Testing Deep Neural Networks

In Defense of Simple Techniques for Neural Network Test Case Selection