BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

Xiaoyun Xu,Zhuoran Liu,Stefanos Koffas,Shujian Yu,Stjepan Picek
2024-05-30
Abstract:Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with 9.99% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available.\url{https://anonymous.4open.science/r/ban-4B32}
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to detect backdoor attacks in deep learning. Specifically, the existing backdoor defense methods are mainly based on backdoor inversion techniques. These methods recover a mask in the feature space to locate significant backdoor features, thereby separating benign features from backdoor features. However, these methods have high computational overhead and rely too much on significant backdoor features that are highly distinguishable from benign features. To overcome these drawbacks, this paper proposes a new backdoor detection method - BAN (Backdoor Activation by Adversarial Neuron Noise), which improves backdoor feature inversion by introducing additional neuron activation information to detect backdoor attacks more efficiently. The main contributions of BAN include: 1. **Discovering the insufficient generalization of existing trigger - inversion detection methods**: In particular, feature - space detection relies too much on highly distinguishable significant backdoor features. The author conducts an in - depth analysis of trigger - inversion - based backdoor defenses and points out that the significant backdoor features exploited by the current state - of - the - art defense methods may not be applicable to the identification of backdoor in the input space. 2. **Proposing the BAN method**: By introducing neuron noise in feature - space trigger inversion, BAN includes an adversarial learning process that incorporates neuron activation information into inversion - based backdoor detection. Experimental results show that on CIFAR - 10, BAN is 1.37 times more efficient than the state - of - the - art defense method BTI - DBF, and on ImageNet200, it is 5.11 times more efficient and the detection success rate is increased by 9.99%. 3. **Designing a simple backdoor removal method**: Using neuron noise, a simple and effective defense method is further designed to remove the backdoor, thereby constructing a complete defense framework. Through these contributions, BAN aims to provide a more efficient and more general backdoor detection and defense scheme, especially when dealing with different types of backdoor attacks.