Abstract:Deep Neural Networks (DNNs) have been widely used in various domains, such as computer vision and software engineering. Although many DNNs have been deployed to assist various tasks in the real world, similar to traditional software, they also suffer from defects that may lead to severe outcomes. DNN testing is one of the most widely used methods to ensure the quality of DNNs. Such method needs rich test inputs with oracle information (expected output) to reveal the incorrect behaviors of a DNN model. However, manually labeling all the collected test inputs is a labor-intensive task, which delays the quality assurance process. Test selection tackles this problem by carefully selecting a small, more suspicious set of test inputs to label, enabling the failure detection of a DNN model with reduced effort. Researchers have proposed different test selection methods, including neuron-coverage-based and uncertainty-based methods, where the uncertainty-based method is arguably the most popular technique. Unfortunately, existing uncertainty-based selection methods meet the performance bottleneck due to one or several limitations: 1) they ignore noisy data in real scenarios; 2) they wrongly exclude many failure-revealing test inputs but rather include many successful test inputs (referring to those test inputs that are correctly predicted by the model); 3) they ignore the diversity of the selected test set. In this paper, we propose RTS, a Robust Test Selection method for deep neural networks to overcome the limitations mentioned above. First, RTS divides all unlabeled candidate test inputs into noise set, successful set, and suspicious set and assigns different selection prioritization to divided sets, which effectively alleviates the impact of noise and improves the ability to identify suspect test inputs. Subsequently, RTS leverages a probability-tier-matrix-based test metric for prioritizing the test inputs in each divided set (i.e., suspicious, successful, and noise set). As a result, RTS can select more suspicious test inputs within a limited selection size. We evaluate RTS by comparing it with 14 baseline methods under 5 widely-used DNN models and 6 widely-used datasets. The experimental results demonstrate that RTS can significantly outperform all test selection methods in failure detection capability and the test suites selected by RTS have the best model optimization capability. For example, when selecting 2.5% test input, RTS achieves an improvement of 9.37%-176.75% over baseline methods in terms of failure detection.

Stratified Random Sampling for Neural Network Test Input Selection

Neuron Sensitivity Guided Test Case Selection for Deep Learning Testing

White-box Fairness Testing Through Adversarial Sampling

Robust Test Selection for Deep Neural Networks

DeepSample: DNN sampling-based testing for operational accuracy assessment

Adaptive Test Selection for Deep Neural Networks

Can test input selection methods for deep neural network guarantee test diversity? A large-scale empirical study

In Defense of Simple Techniques for Neural Network Test Case Selection

Boosting Operational DNN Testing Efficiency Through Conditioning

Evaluating the Robustness of Test Selection Methods for Deep Neural Networks

Distance-Aware Test Input Selection for Deep Neural Networks

DeepGini: prioritizing massive tests to enhance the robustness of deep neural networks

Estimation of Small Failure Probability Based on Adaptive Subset Simulation and Deep Neural Network

Test Selection for Deep Learning Systems

DeepState: Selecting Test Suites to Enhance the Robustness of Recurrent Neural Networks

Spatial Variability-Based Sample Size Allocation for Stratified Sampling

Automatic Fairness Testing of Neural Classifiers through Adversarial Sampling

Measuring model variability using robust non-parametric testing

Input Prioritization for Testing Neural Networks

Validity Matters: Uncertainty‐Guided Testing of Deep Neural Networks

DOS: Diverse Outlier Sampling for Out-of-Distribution Detection