Abstract:Most existing methods to detect backdoored machine learning (ML) models take one of the two approaches: trigger inversion (aka. reverse engineer) and weight analysis (aka. model diagnosis). In particular, the gradient-based trigger inversion is considered to be among the most effective backdoor detection techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge and backdoorBench. However, little has been done to understand why this technique works so well and, more importantly, whether it raises the bar to the backdoor attack. In this paper, we report the first attempt to answer this question by analyzing the change rate of the backdoored model around its trigger-carrying inputs. Our study shows that existing attacks tend to inject the backdoor characterized by a low change rate around trigger-carrying inputs, which are easy to capture by gradient-based trigger inversion. In the meantime, we found that the low change rate is not necessary for a backdoor attack to succeed: we design a new attack enhancement called \textit{Gradient Shaping} (GRASP), which follows the opposite direction of adversarial training to reduce the change rate of a backdoored model with regard to the trigger, without undermining its backdoor effect. Also, we provide a theoretic analysis to explain the effectiveness of this new technique and the fundamental weakness of gradient-based trigger inversion. Finally, we perform both theoretical and experimental analysis, showing that the GRASP enhancement does not reduce the effectiveness of the stealthy attacks against the backdoor detection methods based on weight analysis, as well as other backdoor mitigation methods without using detection.

What problem does this paper attempt to address?

The paper primarily explores a critical issue in backdoor attack detection techniques for machine learning (ML) models, specifically the effectiveness and potential limitations of existing gradient-based trigger inversion techniques in detecting backdoor models. It proposes a novel enhancement method—Gradient Shaping (GRASP)—to improve the stealthiness of backdoor attacks and their resistance to existing detection techniques. The core contributions of the paper can be summarized as follows: 1. **In-depth Analysis of Trigger Inversion Techniques**: For the first time, a detailed analysis is provided on why gradient-based trigger inversion techniques are so effective in detecting backdoor models. It reveals the vulnerability of current trigger injection techniques, which can be easily circumvented by certain types of backdoor attacks. 2. **New Backdoor Injection Technique**: A new backdoor injection technique, Gradient Shaping (GRASP), is proposed. It leverages the fundamental limitations of gradient optimization to enhance existing backdoor attacks under practical threat models, making them harder to detect using trigger inversion techniques without reducing their ability to evade other detection methods such as weight analysis. Key findings of the paper include: - **Relationship Between Trigger Effective Radius and Detection Effectiveness**: The study finds a significant correlation between the trigger effective radius of backdoor models (i.e., the allowable range of changes for the trigger to remain active) and the effectiveness of trigger inversion detection techniques. Backdoor attacks with a smaller trigger effective radius are easier to evade detection. - **Design Principles of GRASP**: GRASP works by adding noisy trigger input samples to the training dataset through data poisoning. These samples are labeled as either the target class or the source class, thereby reducing the effective radius of the trigger and making it difficult for gradient-based trigger inversion techniques to succeed. - **Theoretical and Experimental Validation**: The paper provides theoretical analysis to explain how GRASP weakens the detection capability of trigger inversion techniques by reducing the trigger's effective radius. Experimental validation shows that GRASP can significantly improve the stealthiness of backdoor attacks, making it difficult for existing representative trigger inversion defenses such as Neural Cleanse and tabor to effectively detect backdoor triggers. In summary, the paper not only deeply analyzes the strengths and limitations of existing backdoor detection techniques but also proposes a novel method to enhance the stealthiness of backdoor attacks. This is of great significance for understanding the current state and development trends in the field of backdoor attack detection.

Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering

B3: Backdoor Attacks Against Black-box Machine Learning Models

Enhanced Coalescence Backdoor Attack Against DNN Based on Pixel Gradient

Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks

Escaping Backdoor Attack Detection of Deep Learning

Rethinking the Reverse-engineering of Trojan Triggers

SGBA: A Stealthy Scapegoat Backdoor Attack Against Deep Neural Networks

Backdoor Mitigation by Correcting the Distribution of Neural Activations

On Model Outsourcing Adaptive Attacks to Deep Learning Backdoor Defenses

LSP Framework: A Compensatory Model for Defeating Trigger Reverse Engineering via Label Smoothing Poisoning

A Practical Trigger-Free Backdoor Attack on Neural Networks

An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers

Rethinking Backdoor Detection Evaluation for Language Models

Mitigating Backdoor Attack Via Prerequisite Transformation

Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness

Dynamic Backdoor Attacks Against Machine Learning Models

Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

Unlearning Backdoor Attacks through Gradient-Based Model Pruning

Backdoor Mitigation by Distance-Driven Detoxification

Eliminating Backdoors in Neural Code Models via Trigger Inversion

A stealthy and robust backdoor attack via frequency domain transform