Universal Backdoor Attacks

Benjamin Schneider,Nils Lukas,Florian Kerschbaum
2024-01-20
Abstract:Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset. Our source code is available at <a class="link-external link-https" href="https://github.com/Ben-Schneider-code/Universal-Backdoor-Attacks" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Cryptography and Security,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the vulnerability of large - scale image classification models to untrusted datasets during the training process, especially in the face of the threat of backdoor attacks. Specifically, the researchers focus on how to implant backdoors into deep image classifiers during the training stage through data poisoning, and these backdoors can lead to misclassification of specific categories during the inference stage. Existing backdoor attacks are usually only targeted at a single pre - set target category, while this paper proposes a new method - Universal Backdoor Attacks, which can attack multiple categories simultaneously and only needs to manipulate a very small number of data samples (for example, only 0.15% of the training data needs to be contaminated) to achieve control over a large number of categories. ### Main contributions of the paper 1. **Demonstrate the actual threat of universal backdoor attacks in deep image classification models**: This type of attack allows attackers to control thousands of categories. 2. **Introduce the technology for creating universal poisons**: By taking advantage of the inter - class poison transferability between different categories, the trigger features can be reused to target new categories. 3. **Prove the robustness of universal backdoor attacks against multiple defense measures**: Even in the face of the current state - of - the - art defense methods, universal backdoor attacks can still maintain a high attack success rate. ### Method overview - **Threat model**: It is assumed that an attacker can manipulate part of the data in the dataset crawled from the network to implant a backdoor in the victim's model. The attacker's goal is to create a universal backdoor that can target any category in the victim's model while contaminating as little training data as possible. - **Inter - class poison transferability**: The study found that increasing the average attack success rate on certain categories can improve the attack success rate on another completely unrelated set of categories. This indicates that by learning the poison of one category, the attack effect on other similar categories can be enhanced. - **Trigger generation**: By using Linear Discriminant Analysis (LDA) to compress the latent space, and then generate binary - coded triggers according to the latent features of each category. These triggers are embedded in the images to achieve attacks on the target categories. ### Experimental results - **Effectiveness on the ImageNet - 1K dataset**: Experiments were carried out using patch triggers and blend triggers. The results show that even when only 0.16% of the dataset is contaminated, the patch trigger can achieve an attack success rate of over 80.1%. - **Extension to larger - scale datasets**: Experiments were carried out on the ImageNet - 2K, ImageNet - 4K and ImageNet - 6K datasets. The results show that universal backdoor attacks are still effective on these larger - scale datasets, especially achieving an attack success rate of over 90% on the ImageNet - 4K dataset. - **Robustness against defense measures**: Four state - of - the - art defense methods (fine - tuning, fine - pruning, neural attention distillation and neural cleansing) were evaluated for their effectiveness against universal backdoor attacks. The results show that universal backdoor attacks are highly robust against these defense measures. ### Conclusion This paper demonstrates the actual threat of universal backdoor attacks in deep image classification models and proposes effective methods for creating and using these attacks. These findings are of great significance for deep - learning practitioners' security considerations when training and deploying image classifiers.