Abstract:Quality assurance of deep neural networks (DNNs) is crucial for the deployment of DNN-based software, especially in mission- and safety-critical tasks. Inspired by structural white-box testing in traditional software, many test criteria have been proposed to test DNNs, i.e., to exhibit erroneous behaviors by activating new test units that have not been covered, such as new neurons, values, and decision paths. Many studies have been done to evaluate the effectiveness of DNN test coverage criteria. However, existing empirical studies mainly focused on measuring the effectiveness of DNN test criteria for improving the adversarial robustness of DNNs, while ignoring the correctness property when testing DNNs. To fill in this gap, we conduct a comprehensive study on 11 structural coverage criteria, 6 widely-used image datasets, and 9 popular DNNs. We investigate the effectiveness of DNN coverage criteria over natural inputs from 4 aspects: (1) the correlation between test coverage and test diversity; (2) the effects of criteria parameters and target DNNs; (3) the effectiveness to prioritize in-distribution natural inputs that lead to erroneous behaviors; (4) the capability to detect out-of-distribution natural samples. Our findings include: (1) For measuring the diversity, coverage criteria considering the relationship between different neurons are more effective than coverage criteria that treat each neuron independently. For instance, the neuron-path criteria (i.e., SNPC and ANPC) show high correlation with test diversity, which is promising to measure test diversity for DNNs. (2) The hyper-parameters have a big influence on the effectiveness of criteria, especially those relevant to the granularity of test criteria. Meanwhile, the computational complexity is one of the important issues to be considered when designing deep learning test coverage criteria, especially for large-scale models. (3) Test criteria related to data distribution (i.e., LSA and DSA, SNAC, and NBC) can be used to prioritize both in-distribution natural faults and out-of-distribution inputs. Furthermore, for OOD detection, the boundary metrics (i.e., SNAC and NBC) are also effective indicators with lower computational costs and higher detection efficiency compared with LSA and DSA. These findings motivate follow-up research on scalable test coverage criteria that improve the correctness of DNNs.

Robust Black-box Testing of Deep Neural Networks using Co-Domain Coverage

An Empirical Study on Correlation between Coverage and Robustness for Deep Neural Networks

There is Limited Correlation Between Coverage and Robustness for Deep Neural Networks

CoCoFuzzing: Testing Neural Co de Models With Co verage-Guided Fuzzing

A White-Box Testing for Deep Neural Networks Based on Neuron Coverage.

DeepCNP: an Efficient White-Box Testing of Deep Neural Networks by Aligning Critical Neuron Paths

CAGFuzz: Coverage-Guided Adversarial Generative Fuzzing Testing of Deep Learning Systems

DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks

Coverage Guided Differential Adversarial Testing of Deep Learning Systems

HashC: Making DNNs' Coverage Testing Finer and Faster.

DeepRTest: A Vulnerability-Guided Robustness Testing and Enhancement Framework for Deep Neural Networks.

SoK: Certified Robustness for Deep Neural Networks

Can Coverage Criteria Guide Failure Discovery for Image Classifiers? an Empirical Study

HashC: Making deep learning coverage testing finer and faster

Coverage Testing of Deep Learning Models using Dataset Characterization

Increasing the Confidence of Deep Neural Networks by Coverage Analysis

DeepCov: Coverage Guided Deep Learning Framework Fuzzing

FDFuzz: Applying Feature Detection to Fuzz Deep Learning Systems

DLFuzz: Differential Fuzzing Testing of Deep Learning Systems.

Robust Adversarial Attacks on Imperfect Deep Neural Networks in Fault Classification