Abstract:We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to propose a new defense method against backdoor attacks in deep neural networks (DNNs). Specifically, the paper focuses on how to detect and defend against backdoor attacks after the model has been developed, without the need to understand how the model was generated. Most of the existing defense strategies focus on the development stage, that is, taking measures during the model training process to prevent the implantation of backdoors. However, these methods are often difficult to apply after the model has been deployed, because they require access to the training dataset or dynamic information during the training process, such as gradient updates or loss information. The solution proposed in the paper is called **BaDExpert** (Backdoor Input Detection with Backdoor Expert), and its core idea is to extract the backdoor function from a model known to be backdoor - attacked and use this function to identify and filter out backdoor inputs. The specific steps are as follows: 1. **Backdoor function extraction**: - By fine - tuning the backdoor - attacked model on a small number of deliberately mislabeled clean samples, make it forget the normal classification function but retain the backdoor function. The resulting model is called the "Backdoor Expert Model". - Expressed in a formula: \[ \text{Backdoor Expert Model} \quad B \leftarrow \text{finetune}(M, D_c, \eta, m) \] where \( M \) is the backdoor - attacked model, \( D_c \) is a small number of deliberately mislabeled clean samples, \( \eta \) is the learning rate, and \( m \) is the number of iterations. 2. **Backdoor input detection**: - Use the backdoor expert model \( B \) and an auxiliary model \( M' \) that has been standard - fine - tuned to detect whether the input is a backdoor input. - The detection rule is based on whether the prediction results of the two models are consistent. If the prediction results of \( B \) and \( M \) are consistent, the input is considered to be a backdoor input; if the prediction results of \( B \) and \( M \) are inconsistent, the input is considered to be a normal input. - The specific decision rule can be expressed as: \[ \text{Reject input } x \quad \text{if} \quad \frac{\text{Conf}_{M'}(\tilde{y}|x)}{\text{Conf}_B(\tilde{y}|x)} \leq \alpha \] where \( \tilde{y} = M(x) \), \( \text{Conf}_B(\tilde{y}|x) \) and \( \text{Conf}_{M'}(\tilde{y}|x) \) respectively represent the confidence of the backdoor expert model \( B \) and the auxiliary model \( M' \) in predicting \( \tilde{y} \) for input \( x \), and \( \alpha \) is the threshold. The paper verifies the effectiveness of the **BaDExpert** method through experiments and shows its robustness and efficiency on multiple datasets (such as CIFAR10, GTSRB, and ImageNet) and different model architectures (such as ResNet, VGG, MobileNetV2, and Vision Transformer). The experimental results show that **BaDExpert** can effectively suppress 17 state - of - the - art backdoor attacks while having little impact on the accuracy of normal inputs.

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

B3: Backdoor Attacks Against Black-box Machine Learning Models

BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

Reverse Backdoor Distillation: Towards Online Backdoor Attack Detection for Deep Neural Network Models

Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

Universal Post-Training Reverse-Engineering Defense Against Backdoors in Deep Neural Networks

Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

Black-box Detection of Backdoor Attacks with Limited Information and Data

Reverse Engineering Imperceptible Backdoor Attacks on Deep Neural Networks for Detection and Training Set Cleansing

Backdoor Mitigation by Correcting the Distribution of Neural Activations

Need for Speed: Taming Backdoor Attacks with Speed and Precision

NBA: defensive distillation for backdoor removal via neural behavior alignment

MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary Backdoor Pattern Types Using a Maximum Margin Statistic

Escaping Backdoor Attack Detection of Deep Learning

Parity measurements of nuclear levels using a free-electron-laser generated gamma-ray beam.

Backdoor Defense Via Deconfounded Representation Learning.

PatchBackdoor: Backdoor Attack against Deep Neural Networks without Model Modification

Enhanced Coalescence Backdoor Attack Against DNN Based on Pixel Gradient

BEAGLE: Forensics of Deep Learning Backdoor Attack for Better Defense

Backdoor Defense via Decoupling the Training Process

BeniFul: Backdoor Defense via Middle Feature Analysis for Deep Neural Networks