Abstract:Backdoor attacks pose an increasingly severe security threat to Deep Neural Networks (DNNs) during their development stage. In response, backdoor sample purification has emerged as a promising defense mechanism, aiming to eliminate backdoor triggers while preserving the integrity of the clean content in the samples. However, existing approaches have been predominantly focused on the word space, which are ineffective against feature-space triggers and significantly impair performance on clean data. To address this, we introduce a universal backdoor defense that purifies backdoor samples in the activation space by drawing abnormal activations towards optimized minimum clean activation distribution intervals. The advantages of our approach are twofold: (1) By operating in the activation space, our method captures from surface-level information like words to higher-level semantic concepts such as syntax, thus counteracting diverse triggers; (2) the fine-grained continuous nature of the activation space allows for more precise preservation of clean content while removing triggers. Furthermore, we propose a detection module based on statistical information of abnormal activations, to achieve a better trade-off between clean accuracy and defending performance.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of existing backdoor sample purification methods when dealing with feature - space triggers. Specifically: 1. **Limitations of existing methods**: - Existing backdoor sample purification methods mainly focus on the word space. These methods purify samples by removing explicit trigger words. However, this method is ineffective for more complex feature - space triggers. - Word - space methods perform poorly when dealing with high - level semantic concepts such as text style or syntactic structure, and may significantly reduce the performance of the model on clean data. 2. **Proposed new method**: - The paper introduces a general backdoor defense method, called **BadActs**, which purifies backdoor samples in the activation space. By pulling abnormal activations towards the optimized minimum clean activation distribution interval, this method can effectively deal with various types of triggers, including word - space and feature - space triggers. - The activation of individual neurons in the activation space contains not only surface information (such as words), but also higher - level semantic concepts (such as syntactic structure and part - of - speech), so it can more comprehensively capture and eliminate backdoor triggers. 3. **Specific contributions**: - **Problem analysis**: Point out the deficiencies of existing methods when facing feature - space attacks, and analyze the reasons for the decline in clean data accuracy caused by their coarse - grained purification strategies. - **Introduction of new method**: Propose a method of purification in the activation space, and introduce a detection module based on statistical information to optimize the trade - off between clean accuracy and defense performance. - **Experimental verification**: Through extensive experiments, it is proved that BadActs performs well on multiple datasets and different types of attacks, especially in dealing with feature - space triggers. In summary, this paper aims to overcome the limitations of existing methods in dealing with complex triggers by purifying backdoor samples in the activation space, thereby improving the security and robustness of deep neural networks.

BadActs: A Universal Backdoor Defense in the Activation Space

Redeem Myself: Purifying Backdoors in Deep Learning Models Using Self Attention Distillation.

FLARE: Towards Universal Dataset Purification against Backdoor Attacks

Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Enhanced Coalescence Backdoor Attack Against DNN Based on Pixel Gradient

Beating Backdoor Attack at Its Own Game

Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection

PAD-FT: A Lightweight Defense for Backdoor Attacks via Data Purification and Fine-Tuning

Backdoor Mitigation by Correcting the Distribution of Neural Activations

Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

Not All Samples Are Born Equal: Towards Effective Clean-Label Backdoor Attacks

PiDAn: A Coherence Optimization Approach for Backdoor Attack Detection and Mitigation in Deep Neural Networks

BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

Backdoor Cleansing with Unlabeled Data

Need for Speed: Taming Backdoor Attacks with Speed and Precision

Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks

Test-Time Backdoor Defense via Detecting and Repairing

Long-Tailed Backdoor Attack Using Dynamic Data Augmentation Operations

A Practical Trigger-Free Backdoor Attack on Neural Networks

NBA: defensive distillation for backdoor removal via neural behavior alignment