Abstract:Deep Neural Networks (DNNs) are increasingly deployed across applications. However, ensuring their reliability remains a challenge, and in many situations, alternative models with similar functionality and accuracy are available. Traditional accuracy-based evaluations often fail to capture behavioral differences between models, especially with limited test datasets, making it difficult to select or combine models effectively. Differential testing addresses this by generating test inputs that expose discrepancies in DNN model behavior. However, existing approaches face significant limitations: many rely on model internals or are constrained by available seed inputs. To address these challenges, we propose DiffGAN, a black-box test image generation approach for differential testing of DNN models. DiffGAN leverages a Generative Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II to generate diverse and valid triggering inputs that reveal behavioral discrepancies between models. DiffGAN employs two custom fitness functions, focusing on diversity and divergence, to guide the exploration of the GAN input space and identify discrepancies between models' outputs. By strategically searching this space, DiffGAN generates inputs with specific features that trigger differences in model behavior. DiffGAN is black-box, making it applicable in more situations. We evaluate DiffGAN on eight DNN model pairs trained on widely used image datasets. Our results show DiffGAN significantly outperforms a SOTA baseline, generating four times more triggering inputs, with greater diversity and validity, within the same budget. Additionally, the generated inputs improve the accuracy of a machine learning-based model selection mechanism, which selects the best-performing model based on input characteristics and can serve as a smart output voting mechanism when using alternative models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in deep neural network (DNN) models, ensuring their reliability remains a challenge, especially when there are multiple alternative models with similar functions and accuracies available. Traditional accuracy - based evaluation methods often fail to capture the behavioral differences between these models, especially when the test data set is limited, which makes it difficult to select or optimally combine models. ### Specific Problem Description 1. **Exposure of Behavioral Differences**: - In many cases, although multiple DNN models show similar accuracies on a given test data set, they may exhibit significant behavioral differences under different operating conditions. - Traditional methods are difficult to generate triggering inputs that can reveal these behavioral differences, especially when the test data set is limited. 2. **Challenges of Black - Box Models**: - Many existing differential testing methods rely on access to the internal structure of the model, which is often not feasible in practical applications because many models are proprietary and their internal structures are not publicly available. - Therefore, a black - box testing method that does not require access to the internal structure of the model is needed. 3. **Resource Constraints**: - The test budget is limited, especially in cases where expensive simulators or manual verification are required, and the test cost can increase rapidly. - It is necessary to efficiently generate valid triggering inputs within limited resources. 4. **Validity of Inputs**: - Ensure that the generated triggering inputs are valid and realistic. Invalid inputs may lead to mis - predictions and affect model performance evaluation. - A method is needed to ensure that the generated inputs are not only diverse but also valid. ### Solution Overview To solve the above problems, the authors propose DiffGAN, a black - box test generation method for differential testing. DiffGAN uses generative adversarial networks (GAN) and non - dominated sorting genetic algorithm II (NSGA - II) to generate diverse and valid triggering inputs, thereby effectively revealing the behavioral differences between DNN models. ### Main Contributions 1. **Propose DiffGAN**: A new black - box test generation method that uses GAN and NSGA - II to generate triggering inputs, which is suitable for differential testing of DNN models with similar accuracies. 2. **Performance Evaluation**: Through benchmark tests, it is proved that DiffGAN is significantly superior to existing methods in generating more diverse and valid triggering inputs. 3. **Application Expansion**: It shows how to use the triggering inputs generated by DiffGAN to train machine learning models to dynamically select the most accurate model according to input features, thereby improving the accuracy of model selection. Through these contributions, DiffGAN provides an effective and practical solution for differential testing of DNN models, especially in cases of resource - constrained and inaccessible model internal structures.

DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks

Automated Testing for Deep Learning Systems with Differential Behavior Criteria

Distribution-Aware Testing of Neural Networks Using Generative Models

DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks

Differentiable Augmentation for Data-Efficient GAN Training

Search-based DNN Testing and Retraining with GAN-enhanced Simulations

GeNIe: Generative Hard Negative Images Through Diffusion

CGDTest: A Constrained Gradient Descent Algorithm for Testing Neural Networks

Diffusion-GAN: Training GANs with Diffusion

GGT: Graph-Guided Testing for Adversarial Sample Detection of Deep Neural Network

DeepGini: prioritizing massive tests to enhance the robustness of deep neural networks

Validity Matters: Uncertainty‐Guided Testing of Deep Neural Networks

Diversifying Tire-Defect Image Generation Based on Generative Adversarial Network

Training Discriminative Models to Evaluate Generative Ones

Generating and Detecting True Ambiguity: A Forgotten Danger in DNN Supervision Testing

Toward Efficiently Evaluating the Robustness of Deep Neural Networks in IoT Systems: A GAN-Based Method

GanDef: A GAN based Adversarial Training Defense for Neural Network Classifier

Evaluation of GAN-Based Model for Adversarial Training

Towards the Gradient Vanishing, Divergence Mismatching and Mode Collapse of Generative Adversarial Nets

DeepKnowledge: Generalisation-Driven Deep Learning Testing