Abstract:Backdoor defenses have recently become important in resisting backdoor attacks in deep neural networks (DNNs), where attackers implant backdoors into the DNN model by injecting backdoor samples into the training dataset. Although there are many defense methods to achieve backdoor detection for DNN inputs and backdoor elimination for DNN models, they still have not presented a clear explanation of the relationship between these two missions. In this paper, we use the features from the middle layer of the DNN model to analyze the difference between backdoor and benign samples and propose Backdoor Consistency, which indicates that at least one backdoor exists in the DNN model if the backdoor trigger is detected exactly on input. By analyzing the middle features, we design an effective and comprehensive backdoor defense method named BeniFul, which consists of two parts: a gray-box backdoor input detection and a white-box backdoor elimination. Specifically, we use the reconstruction distance from the Variational Auto-Encoder and model inference results to implement backdoor input detection and a feature distance loss to achieve backdoor elimination. Experimental results on CIFAR-10 and Tiny ImageNet against five state-of-the-art attacks demonstrate that our BeniFul exhibits a great defense capability in backdoor input detection and backdoor elimination.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the problem of resisting backdoor attacks in deep neural networks (DNNs). Specifically, the paper focuses on how to detect and eliminate backdoors by analyzing the features of the intermediate layers of DNNs. The paper proposes a method named BeniFul, which combines gray - box backdoor input detection and white - box backdoor elimination techniques, aiming to provide a comprehensive backdoor defense scheme. ### Paper Background In recent years, with the wide application of deep learning in various fields, backdoor attacks on DNN models have become an important security issue. Backdoor attackers inject backdoor samples with specific triggers into the training data set, making the DNN model learn a strong association between these triggers and the target labels during the training process. This association can manipulate the output of the model through the triggers in the model inference stage, leading to serious consequences. ### Shortcomings of Existing Methods Although there are already many backdoor defense methods, such as separating backdoor samples in the training set, training clean DNN models, detecting backdoor inputs and eliminating backdoors in the models, etc., these methods have not clearly explained the relationship between backdoor detection and backdoor elimination. In addition, existing methods still have room for improvement in terms of detection efficiency and accuracy. ### Paper Contributions 1. **Backdoor Consistency**: The paper proposes the concept of "Backdoor Consistency", that is, if a backdoor trigger is detected in the input, it can be inferred that at least one backdoor exists in the DNN model. This concept provides a theoretical basis for backdoor detection and elimination. 2. **BeniFul Method**: - **Gray - box Backdoor Input Detection (BeniFul - BID)**: Use variational auto - encoder (VAE) to reconstruct intermediate features, and detect backdoor inputs by analyzing the VAE reconstruction distance and the model inference results. This method can complete the detection with only one model inference. - **White - box Backdoor Elimination (BeniFul - BE)**: Define a loss function to make the intermediate features of the model after elimination far away from the features of the original backdoor model, thereby repairing the model attacked by the backdoor. At the same time, this method also maintains the accuracy of the model. ### Experimental Results The paper conducted experiments on the CIFAR - 10 and Tiny ImageNet data sets and evaluated five state - of - the - art backdoor attack methods. The experimental results show that the BeniFul method performs well in both backdoor input detection and backdoor elimination, with an average AUROC score of 0.953, an average ASR decrease of 0.967, and a loss of model accuracy of only 0.028. ### Summary This paper proposes a comprehensive backdoor defense method, BeniFul, by analyzing the features of the intermediate layers of DNNs. This method can not only efficiently detect backdoor inputs, but also effectively eliminate backdoors in the model, providing a new solution for the security of DNNs.

BeniFul: Backdoor Defense via Middle Feature Analysis for Deep Neural Networks

BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

Defending Against Backdoor Attacks by Layer-wise Feature Analysis

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Beating Backdoor Attack at Its Own Game

Backdoor Defense via Decoupling the Training Process

Need for Speed: Taming Backdoor Attacks with Speed and Precision

A lightweight backdoor defense framework based on image inpainting

Enhanced Coalescence Backdoor Attack Against DNN Based on Pixel Gradient

Can We Mitigate Backdoor Attack Using Adversarial Detection Methods?

DeepDefense: A Steganalysis-Based Backdoor Detecting and Mitigating Protocol in Deep Neural Networks for AI Security

Universal Post-Training Reverse-Engineering Defense Against Backdoors in Deep Neural Networks

Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

NTD: Non-Transferability Enabled Deep Learning Backdoor Detection

Imperceptible and Multi-channel Backdoor Attack against Deep Neural Networks

Black-box Detection of Backdoor Attacks with Limited Information and Data

Composite Backdoor Attack for Deep Neural Network by Mixing Existing Benign Features

Backdoor Defense Via Deconfounded Representation Learning

Imperceptible Backdoor Attack: from Input Space to Feature Representation

A Novel Backdoor Attack Adapted to Transfer Learning.

Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models