BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Tinghao Xie,Xiangyu Qi,Ping He,Yiming Li,Jiachen T. Wang,Prateek Mittal
2023-10-05
Abstract:We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).
Cryptography and Security,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to propose a new defense method against backdoor attacks in deep neural networks (DNNs). Specifically, the paper focuses on how to detect and defend against backdoor attacks after the model has been developed, without the need to understand how the model was generated. Most of the existing defense strategies focus on the development stage, that is, taking measures during the model training process to prevent the implantation of backdoors. However, these methods are often difficult to apply after the model has been deployed, because they require access to the training dataset or dynamic information during the training process, such as gradient updates or loss information. The solution proposed in the paper is called **BaDExpert** (Backdoor Input Detection with Backdoor Expert), and its core idea is to extract the backdoor function from a model known to be backdoor - attacked and use this function to identify and filter out backdoor inputs. The specific steps are as follows: 1. **Backdoor function extraction**: - By fine - tuning the backdoor - attacked model on a small number of deliberately mislabeled clean samples, make it forget the normal classification function but retain the backdoor function. The resulting model is called the "Backdoor Expert Model". - Expressed in a formula: \[ \text{Backdoor Expert Model} \quad B \leftarrow \text{finetune}(M, D_c, \eta, m) \] where \( M \) is the backdoor - attacked model, \( D_c \) is a small number of deliberately mislabeled clean samples, \( \eta \) is the learning rate, and \( m \) is the number of iterations. 2. **Backdoor input detection**: - Use the backdoor expert model \( B \) and an auxiliary model \( M' \) that has been standard - fine - tuned to detect whether the input is a backdoor input. - The detection rule is based on whether the prediction results of the two models are consistent. If the prediction results of \( B \) and \( M \) are consistent, the input is considered to be a backdoor input; if the prediction results of \( B \) and \( M \) are inconsistent, the input is considered to be a normal input. - The specific decision rule can be expressed as: \[ \text{Reject input } x \quad \text{if} \quad \frac{\text{Conf}_{M'}(\tilde{y}|x)}{\text{Conf}_B(\tilde{y}|x)} \leq \alpha \] where \( \tilde{y} = M(x) \), \( \text{Conf}_B(\tilde{y}|x) \) and \( \text{Conf}_{M'}(\tilde{y}|x) \) respectively represent the confidence of the backdoor expert model \( B \) and the auxiliary model \( M' \) in predicting \( \tilde{y} \) for input \( x \), and \( \alpha \) is the threshold. The paper verifies the effectiveness of the **BaDExpert** method through experiments and shows its robustness and efficiency on multiple datasets (such as CIFAR10, GTSRB, and ImageNet) and different model architectures (such as ResNet, VGG, MobileNetV2, and Vision Transformer). The experimental results show that **BaDExpert** can effectively suppress 17 state - of - the - art backdoor attacks while having little impact on the accuracy of normal inputs.