FLARE: Towards Universal Dataset Purification against Backdoor Attacks

Linshan Hou,Wei Luo,Zhongyun Hua,Songhua Chen,Leo Yu Zhang,Yiming Li
2024-11-29
Abstract:Deep neural networks (DNNs) are susceptible to backdoor attacks, where adversaries poison datasets with adversary-specified triggers to implant hidden backdoors, enabling malicious manipulation of model predictions. Dataset purification serves as a proactive defense by removing malicious training samples to prevent backdoor injection at its source. We first reveal that the current advanced purification methods rely on a latent assumption that the backdoor connections between triggers and target labels in backdoor attacks are simpler to learn than the benign features. We demonstrate that this assumption, however, does not always hold, especially in all-to-all (A2A) and untargeted (UT) attacks. As a result, purification methods that analyze the separation between the poisoned and benign samples in the input-output space or the final hidden layer space are less effective. We observe that this separability is not confined to a single layer but varies across different hidden layers. Motivated by this understanding, we propose FLARE, a universal purification method to counter various backdoor attacks. FLARE aggregates abnormal activations from all hidden layers to construct representations for clustering. To enhance separation, FLARE develops an adaptive subspace selection algorithm to isolate the optimal space for dividing an entire dataset into two clusters. FLARE assesses the stability of each cluster and identifies the cluster with higher stability as poisoned. Extensive evaluations on benchmark datasets demonstrate the effectiveness of FLARE against 22 representative backdoor attacks, including all-to-one (A2O), all-to-all (A2A), and untargeted (UT) attacks, and its robustness to adaptive attacks.
Cryptography and Security,Artificial Intelligence,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of the vulnerability of deep neural networks (DNNs) to backdoor attacks. Specifically, the paper focuses on how to defend against various types of backdoor attacks through dataset purification. #### Background of backdoor attacks Backdoor attacks are a malicious behavior. Attackers implant hidden backdoors in the model by injecting malicious samples with specific triggers into the training dataset and re - assigning the labels of these samples. This backdoor enables attackers to maliciously change the prediction results of the model by implanting the same trigger during the inference process. Since the performance of this attack on normal samples is no different from that of the un - attacked model, it is difficult to detect. #### Limitations of existing methods Existing dataset purification methods rely on an implicit assumption: that the connection between the trigger in the backdoor attack and the target label is easier to learn than benign features. However, this assumption does not always hold, especially in all - to - all (A2A) and untargeted (UT) attacks. Existing methods have poor separation effects between poisoned samples and benign samples in the input - output space or the final hidden - layer space. In addition, these methods usually only focus on the feature representation of a specific layer and ignore the differences between different hidden layers. #### Core problem of the paper Therefore, the question raised in the paper is: **How to design a general - purpose dataset purification method to effectively defend against various types of backdoor attacks?** ### Solution To solve the above problems, the paper proposes FLARE (Full - spectrum Learning Analysis for Removing Embedded poisoned samples), a dataset purification method based on full - spectrum learning analysis. The main contributions of FLARE are as follows: 1. **Reveal the limitations of existing methods**: The paper points out that existing advanced dataset purification methods rely on an implicit assumption that backdoor connections are easier to learn than benign features. However, this assumption does not always hold in A2A and UT attacks, and the separation of poisoned samples and benign samples within a specific layer is not consistent. 2. **Propose a general - purpose purification method**: FLARE constructs a representation by aggregating abnormal activations of all hidden layers and uses an adaptive subspace selection algorithm to divide the entire dataset into two clusters, evaluates the stability of each cluster, and identifies the cluster with higher stability as the poisoned samples. This method not only considers the input - output relationship but also covers the hidden - layer features of the entire model. 3. **Experimental verification**: The paper conducts extensive experiments on benchmark datasets to verify the effectiveness of FLARE against 22 representative backdoor attacks, including A2O, A2A, and UT attacks, and shows its robustness against potential adaptive attacks. Through these improvements, FLARE can provide effective defense in a wider range of attack scenarios, thereby reducing the possibility of backdoor threats originating from the source.