Abstract:ObjectiveThe emerging convolutional neural networks（CNNs） have shown its potentials in the context of computer science, electronic information, mathematics, and finance. However, the security issue is challenged for multiple domains. It is capable to use the neural network model to predict the samples with triggers as target labels in the inference stage through adding the samples with triggers to the data set and changing the labels of samples to target labels in the training process of supervised learning. Backdoor attacks have threaten the interests of model owners severely, especially in high value-added areas like financial security. To preserve backdoor attacks-derived neural network model, a series of defense strategies are implemented. However, conventional defense methods are often required for the prior knowledge of backdoor attack methods or neural network models in relevant to the type and size of the trigger, which is inconsistent and limits the application scenarios of defense methods. To resolve this problem, we develop a backdoor defense method based on inputmodified image classification task, called information purification network（IPN）. The process of the IPNcan eliminates the impact of the trigger-added samples.MethodTo alleviate a large amount of redundant information in image samples, we segment the image information into two categories: 1) classification task-oriented semantic information, and 2) classification task-inrelevant non-semantic information. To get the sample being predicted as the target label for interpretation, backdoor attack can enforce the model to pay attention to the non-semantic information of the sample during the model training process. To suppress the noise of trigger, our IPN is demonstrated as a CNN used for encoding and decoding the input samples, which aims to keep the image semantics unchanged via minimizing the non-semantic information in the original samples. The inputs to the IPN are as the clean samples, as well as the outputs are as the modified samples. For specific training, first, several clean classifiers are trained on the basis of multiple structures and training hyperparameters. Then, the IPN is optimized to make the difference between the modified sample and the original sample as large as possible on the premise of keeping the modified sample correctly predicted by the above classifier. The loss function consists of two aspects as mentioned below: 1) semantic information retention, and 2) non-semantic information suppression. To alleviate the difference between the sample and the original sample, the weight of the two parts of the loss function can be balanced. The process of IPN-related sample decoding can disrupt the structure of the trigger. Therefore, the sample will not be predicted as the target label even if the model is injected backdoor. In addition, due to the semantic information in the samples image is not weakened, trigger-involved samples can be used to predict the correct labels whether the model is injected into the backdoor or not.ResultAll experiments are performed on NVIDIA Ge Force RTX 3090 graphics card. The execution environment is Python 3. 8. 5 with Pytorch version 1. 9. 1. The datasets are tested in relevant to CIFAR10, MNIST, and ImageNet10. The Image Net10 dataset is constructed in terms of selecting 10 categories from the Image Net dataset in random,which are composed of 12 831 images in total. We randomly selected 10 264 images as the training dataset, and the remaining 2 567 images as the test dataset. The architecture of the IPN is U-Net. To evaluate the defense performance of the proposed strategy in detail, a variety of different triggers are used to implement backdoor attacks. For MNIST datasets, the classification accuracy of the clean model for the initial clean sample is 99%. We use two different triggers to implement backdoor attacks as well. Each average classification accuracy of clean samples is 99%, and the success rates of backdoor attacks are 100%. After all samples are encoded and decoded by the IPN, the classification accuracy of clean samples is remained in consistent, while the success rate of backdoor attacks dropped to 10%, and the backdoor samples are predicted to be correctly labeled 98%as well. The experimental results are similar to MNIST for the other two datasets. While the classification accuracy of clean samples decreases slightly, the success rate of backdoor attacks is optimized about 10%,and the backdoor samples are correctly predicted with high accuracy. It should be mentioned that the intensity and size of the triggers can impact the defensive performance of the proposed strategy to a certain extent. The weight between the two parts of the loss function will affect the accuracy of clean samples. The weight of non-semantic information suppression loss is positive correlated to the difference of images and negative correlated to the classification accuracy of clean samples.ConclusionOur proposed strategy is not required any prior knowledge for triggers and the models to be protected. The classification accuracy of clean samples can keep unchanged, and the success rate of backdoor attack is equivalent to random guess, and the backdoor samples will be predicted as correct labels by classifiers, regardless of the problem of classifiers are injected into the backdoor. The training of the IPN is required on clean training data and the task of the protected model only. In the implementation of defense, the IPN can just be configured to predominate the protected model for input sample preprocessing. Multiple backdoor attacks are simulated on the three mentioned data sets. Experimental results show that our defense strategy is an optimized implementation for heterogeneity.

NIP: Neuron-level Inverse Perturbation Against Adversarial Attacks.

Attack As Defense: Characterizing Adversarial Examples Using Robustness.

Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence

A Universal Defense Strategy Against Adversarial Attacks Based on Attention-Guided

DeepDefense: Training Deep Neural Networks with Improved Robustness.

Mitigating Adversarial Attacks for Deep Neural Networks by Input Deformation and Augmentation

Deep Defense: Training DNNs with Improved Adversarial Robustness

Invisible Adversarial Attack Against Deep Neural Networks: an Adaptive Penalization Approach

Attacking Adversarial Attacks as A Defense

Defense against adversarial attacks by low‐level image transformations

Adversarial Attacks Against Deep Learning-Based Network Intrusion Detection Systems and Defense Mechanisms

Adversarial Example Defense via Perturbation Grading Strategy

Visual Analytics of Neuron Vulnerability to Adversarial Attacks on Convolutional Neural Networks

Investigating Human-Identifiable Features Hidden in Adversarial Perturbations

Non-semantic Information Suppression Relevant Backdoor Defense Implementation

Adversarial alignment: Breaking the trade-off between the strength of an attack and its relevance to human perception

Are You Confident That You Have Successfully Generated Adversarial Examples?

Towards Unified Robustness Against Both Backdoor and Adversarial Attacks

Analyzing the Noise Robustness of Deep Neural Networks

Attack Anything: Blind DNNs via Universal Background Adversarial Attack

Towards Query Efficient Black-box Attacks