Abstract:Deep learning models for image classification have become standard tools in recent years. A well known vulnerability of these models is their susceptibility to adversarial examples. These are generated by slightly altering an image of a certain class in a way that is imperceptible to humans but causes the model to classify it wrongly as another class. Many algorithms have been proposed to address this problem, falling generally into one of two categories: (i) building robust classifiers (ii) directly detecting attacked images. Despite the good performance of these detectors, we argue that in a white-box setting, where the attacker knows the configuration and weights of the network and the detector, they can overcome the detector by running many examples on a local copy, and sending only those that were not detected to the actual model. This problem is common in security applications where even a very good model is not sufficient to ensure safety. In this paper we propose to overcome this inherent limitation of any static defence with randomization. To do so, one must generate a very large family of detectors with consistent performance, and select one or more of them randomly for each input. For the individual detectors, we suggest the method of neural fingerprints. In the training phase, for each class we repeatedly sample a tiny random subset of neurons from certain layers of the network, and if their average is sufficiently different between clean and attacked images of the focal class they are considered a fingerprint and added to the detector bank. During test time, we sample fingerprints from the bank associated with the label predicted by the model, and detect attacks using a likelihood ratio test. We evaluate our detectors on ImageNet with different attack methods and model architectures, and show near-perfect detection with low rates of false detection.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the completely white - box attack scenario, existing adversarial sample detection methods are easily bypassed. Specifically, when an attacker has a complete understanding of the system configuration and weights, they can run a large number of attack attempts locally, screen out adversarial samples that can successfully bypass the detector, and then submit them to the actual system. In this case, even if the detector has excellent performance, it cannot ensure the security of the system. Therefore, the paper proposes a randomized detection method based on neural fingerprints. By generating a large number of unrelated detectors and randomly selecting one or more of them to apply during testing, it prevents attackers from finding inputs that can bypass all detectors in advance. ### Background and Method of the Paper 1. **Adversarial Attack**: Adversarial attack refers to the act of making a machine - learning model misclassify by making small but carefully designed perturbations to the input data. These attacks are particularly common in image classification tasks and usually use the gradient - descent method to generate adversarial samples. 2. **Existing Defense Methods**: Existing defense methods are mainly divided into two categories: one is to build a robust classifier, improving the robustness of the model by introducing adversarial samples during the training process; the other is to directly detect adversarial samples, using an additional detector to identify whether the input has been tampered with. 3. **Challenges of White - Box Attacks**: In the white - box attack scenario, the attacker has complete information about the system and can find adversarial samples that can bypass the detector through offline experiments. Therefore, static detectors cannot provide effective defense. ### Neural Fingerprint Method 1. **Definition of Neural Fingerprint**: A neural fingerprint is the average of the activation values of a randomly selected group of neurons from a specific layer of the network. By comparing the distribution differences of these fingerprints in clean samples and adversarial samples, adversarial attacks can be detected. 2. **Generation Process**: - For each category, randomly select a small number of neurons from the last few layers of the network. - Calculate the average activation values of these neurons in clean samples and adversarial samples. - Use Cohen’s d effect size to evaluate the effectiveness of the fingerprint in distinguishing between clean samples and adversarial samples. - Select fingerprints with good effects and add them to the detector library. 3. **Detection Process**: - During testing, randomly select one or more fingerprints from the detector library. - Use the likelihood - ratio test, voting method, or anomaly detection method to determine whether the input is an adversarial sample. ### Experimental Results 1. **Dataset**: The experiment was carried out on the ImageNet validation set, using two model architectures, Inception V3 and ViT. 2. **Attack Method**: Two attack methods, IFGSM and PGD, were used. 3. **Detection Performance**: The experimental results show that the neural fingerprint method can achieve a high detection rate under different models and attack methods. Especially under the likelihood - ratio detection rule, the detection rate can reach 99.9% with a low false - positive rate. ### Main Contributions 1. **For the first time, point out the shortcomings of static detectors in the completely white - box attack scenario**. 2. **Propose and verify the randomized detection method based on neural fingerprints**, which improves the security of the system by generating a large number of unrelated detectors. ### Conclusion By introducing neural fingerprints and the randomized detection strategy, the paper effectively solves the problem of adversarial sample detection in the completely white - box attack scenario, providing new ideas for improving the security of deep - learning systems.

Neural Fingerprints for Adversarial Attack Detection

Attack As Detection: Using Adversarial Attack Methods to Detect Abnormal Examples.

When Not to Classify: Anomaly Detection of Attacks (ADA) on DNN Classifiers at Test Time

Detecting Adversarial Image Examples in Deep Neural Networks with Adaptive Noise Reduction

Adversarial Attacks on Convolutional Neural Networks in Facial Recognition Domain

Attacking Adversarial Attacks as A Defense

Undetectable Attack to Deep Neural Networks Without Using Model Parameters.

Invisible Adversarial Attacks on Deep Learning-Based Face Recognition Models.

DeepTaster: Adversarial Perturbation-Based Fingerprinting to Identify Proprietary Dataset Use in Deep Neural Networks

Detecting Adversarial Examples on Deep Neural Networks with Mutual Information Neural Estimation

Local Adaptive Gradient Variance Attack for Deep Fake Fingerprint Detection

Mitigating Adversarial Attacks for Deep Neural Networks by Input Deformation and Augmentation

Detecting Adversarial Examples

From Spatial to Spectral Domain, a New Perspective for Detecting Adversarial Examples

Fast Confidence Detection - One Hot Way to Detect Adversarial Attacks via Sensor Pattern Noise Fingerprinting.

Designing defensive techniques to handle adversarial attack on deep learning based model

Defending Against Adversarial Attacks Using Digital Image Processing

A Framework for Robust Deep Learning Models Against Adversarial Attacks Based on a Protection Layer Approach

Investigating Human-Identifiable Features Hidden in Adversarial Perturbations

MetaAdvDet: Towards Robust Detection of Evolving Adversarial Attacks

Targeted Black-Box Adversarial Attack Method for Image Classification Models.