Abstract:Backdoor defenses have recently become important in resisting backdoor attacks in deep neural networks (DNNs), where attackers implant backdoors into the DNN model by injecting backdoor samples into the training dataset. Although there are many defense methods to achieve backdoor detection for DNN inputs and backdoor elimination for DNN models, they still have not presented a clear explanation of the relationship between these two missions. In this paper, we use the features from the middle layer of the DNN model to analyze the difference between backdoor and benign samples and propose Backdoor Consistency, which indicates that at least one backdoor exists in the DNN model if the backdoor trigger is detected exactly on input. By analyzing the middle features, we design an effective and comprehensive backdoor defense method named BeniFul, which consists of two parts: a gray-box backdoor input detection and a white-box backdoor elimination. Specifically, we use the reconstruction distance from the Variational Auto-Encoder and model inference results to implement backdoor input detection and a feature distance loss to achieve backdoor elimination. Experimental results on CIFAR-10 and Tiny ImageNet against five state-of-the-art attacks demonstrate that our BeniFul exhibits a great defense capability in backdoor input detection and backdoor elimination.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the problem of resisting backdoor attacks in deep neural networks (DNNs). Specifically, the paper focuses on how to detect and eliminate backdoors by analyzing the features of the intermediate layers of DNNs. The paper proposes a method named BeniFul, which combines gray - box backdoor input detection and white - box backdoor elimination techniques, aiming to provide a comprehensive backdoor defense scheme.
### Paper Background
In recent years, with the wide application of deep learning in various fields, backdoor attacks on DNN models have become an important security issue. Backdoor attackers inject backdoor samples with specific triggers into the training data set, making the DNN model learn a strong association between these triggers and the target labels during the training process. This association can manipulate the output of the model through the triggers in the model inference stage, leading to serious consequences.
### Shortcomings of Existing Methods
Although there are already many backdoor defense methods, such as separating backdoor samples in the training set, training clean DNN models, detecting backdoor inputs and eliminating backdoors in the models, etc., these methods have not clearly explained the relationship between backdoor detection and backdoor elimination. In addition, existing methods still have room for improvement in terms of detection efficiency and accuracy.
### Paper Contributions
1. **Backdoor Consistency**: The paper proposes the concept of "Backdoor Consistency", that is, if a backdoor trigger is detected in the input, it can be inferred that at least one backdoor exists in the DNN model. This concept provides a theoretical basis for backdoor detection and elimination.
2. **BeniFul Method**:
- **Gray - box Backdoor Input Detection (BeniFul - BID)**: Use variational auto - encoder (VAE) to reconstruct intermediate features, and detect backdoor inputs by analyzing the VAE reconstruction distance and the model inference results. This method can complete the detection with only one model inference.
- **White - box Backdoor Elimination (BeniFul - BE)**: Define a loss function to make the intermediate features of the model after elimination far away from the features of the original backdoor model, thereby repairing the model attacked by the backdoor. At the same time, this method also maintains the accuracy of the model.
### Experimental Results
The paper conducted experiments on the CIFAR - 10 and Tiny ImageNet data sets and evaluated five state - of - the - art backdoor attack methods. The experimental results show that the BeniFul method performs well in both backdoor input detection and backdoor elimination, with an average AUROC score of 0.953, an average ASR decrease of 0.967, and a loss of model accuracy of only 0.028.
### Summary
This paper proposes a comprehensive backdoor defense method, BeniFul, by analyzing the features of the intermediate layers of DNNs. This method can not only efficiently detect backdoor inputs, but also effectively eliminate backdoors in the model, providing a new solution for the security of DNNs.