Abstract:Backdoor attacks covertly implant triggers into deep neural networks (DNNs) by poisoning a small portion of the training data with pre-designed backdoor triggers. This vulnerability is exacerbated in the era of large models, where extensive (pre-)training on web-crawled datasets is susceptible to compromise. In this paper, we introduce a novel two-step defense framework named Expose Before You Defend (EBYD). EBYD unifies existing backdoor defense methods into a comprehensive defense system with enhanced performance. Specifically, EBYD first exposes the backdoor functionality in the backdoored model through a model preprocessing step called backdoor exposure, and then applies detection and removal methods to the exposed model to identify and eliminate the backdoor features. In the first step of backdoor exposure, we propose a novel technique called Clean Unlearning (CUL), which proactively unlearns clean features from the backdoored model to reveal the hidden backdoor features. We also explore various model editing/modification techniques for backdoor exposure, including fine-tuning, model sparsification, and weight perturbation. Using EBYD, we conduct extensive experiments on 10 image attacks and 6 text attacks across 2 vision datasets (CIFAR-10 and an ImageNet subset) and 4 language datasets (SST-2, IMDB, Twitter, and AG's News). The results demonstrate the importance of backdoor exposure for backdoor defense, showing that the exposed models can significantly benefit a range of downstream defense tasks, including backdoor label detection, backdoor trigger recovery, backdoor model detection, and backdoor removal. We hope our work could inspire more research in developing advanced defense frameworks with exposed models. Our code is available at: <a class="link-external link-https" href="https://github.com/bboylyg/Expose-Before-You-Defend" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is backdoor attacks in deep neural networks (DNNs). Backdoor attacks involve implanting a small amount of data with preset triggers (backdoor triggers) into the training data, causing the model to make incorrect predictions for specific inputs during testing. This type of attack is particularly severe in the era of large-scale models, as these models are often extensively pre-trained on datasets crawled from the web, which are susceptible to tampering. The paper proposes a novel two-step defense framework called "Expose Before You Defend" (EBYD), aimed at unifying existing backdoor defense methods and improving their performance. Specifically, the EBYD framework first reveals the backdoor functionalities in the backdoored model through a preprocessing step called "backdoor exposure," and then applies detection and removal methods to identify and eliminate the backdoor features. ### Main Contributions of the Paper: 1. **Introduction of the EBYD Framework**: This framework divides backdoor defense into two steps. The first step is backdoor exposure, which isolates backdoor functionalities through specialized model preprocessing/editing techniques; the second step is backdoor defense, which applies existing detection and removal techniques to enhance overall performance. 2. **Proposed Clean Unlearning (CUL) Technique**: This is a new backdoor exposure technique that reveals hidden backdoor functionalities by actively unlearning clean features from the backdoored model. CUL is effective even when unlearning is performed on a small number of clean samples. 3. **Exploration of Various Model Preprocessing Techniques**: Including fine-tuning, model sparsification, and weight perturbation, all of which can be used to expose backdoor functionalities in backdoored models. 4. **Comprehensive Experimental Evaluation**: Extensive experiments were conducted on multiple visual and textual datasets, covering 10 image attacks and 6 text attacks. The results show that the EBYD framework significantly outperforms existing state-of-the-art methods in detecting and removing backdoors. ### Key Innovations: - **Importance of Backdoor Exposure**: The paper emphasizes the importance of backdoor exposure for backdoor defense. By exposing backdoor functionalities, the performance of downstream defense tasks such as backdoor label detection, backdoor trigger recovery, backdoor model detection, and backdoor removal can be significantly improved. - **Unified Defense Framework**: The EBYD framework effectively integrates existing backdoor defense methods into a comprehensive defense system. - **Cross-Domain Applicability**: EBYD is not only applicable to backdoor defense in the image domain but can also be extended to language models to defend against various textual backdoor attacks. Overall, by introducing the EBYD framework, this paper provides a systematic solution aimed at comprehensively enhancing the security and robustness of deep neural networks against backdoor attacks.

Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

B3: Backdoor Attacks Against Black-box Machine Learning Models

KerbNet: A QoE-aware Kernel-Based Backdoor Attack Framework

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Evolutionary Trigger Detection and Lightweight Model Repair Based Backdoor Defense

Beating Backdoor Attack at Its Own Game

Enhanced Coalescence Backdoor Attack Against DNN Based on Pixel Gradient

Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

Backdoor Defense Via Deconfounded Representation Learning

An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers

Progressive Backdoor Erasing Via Connecting Backdoor and Adversarial Attacks

BELT: Old-School Backdoor Attacks can Evade the State-of-the-Art Defense with Backdoor Exclusivity Lifting

Backdoor Defense via Decoupling the Training Process

Clean-Label Backdoor Attacks on Video Recognition Models

Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor

Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies

Untargeted Backdoor Attack Against Object Detection

Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing

Reverse Engineering Imperceptible Backdoor Attacks on Deep Neural Networks for Detection and Training Set Cleansing

Stand-in Backdoor: A Stealthy and Powerful Backdoor Attack

Evading Backdoor Defenses: Concealing Genuine Backdoors Through Scapegoat Strategy