Abstract:Backdoor attacks pose a significant security vulnerability for deep neural networks (DNNs), enabling them to operate normally on clean inputs but manipulate predictions when specific trigger patterns occur. In this paper, we consider a practical post-training scenario backdoor defense, where the defender aims to evaluate whether a trained model has been compromised by backdoor attacks. Currently, post-training backdoor detection approaches often operate under the assumption that the defender has knowledge of the attack information, logit output from the model, and knowledge of the model parameters, limiting their implementation in practical scenarios. In contrast, our approach functions as a lightweight diagnostic scanning tool offering interpretability and visualization. By accessing the model to obtain hard labels, we construct decision boundaries within the convex combination of three samples. We present an intriguing observation of two phenomena in backdoored models: a noticeable shrinking of areas dominated by clean samples and a significant increase in the surrounding areas dominated by target labels. Leveraging this observation, we propose Model X-ray, a novel backdoor detection approach based on the analysis of illustrated two-dimensional (2D) decision boundaries. Our approach includes two strategies focused on the decision areas dominated by clean samples and the concentration of label distribution, and it can not only identify whether the target model is infected but also determine the target attacked label under the all-to-one attack strategy. Importantly, it accomplishes this solely by the predicted hard labels of clean inputs, regardless of any assumptions about attacks and prior knowledge of the training details of the model. Extensive experiments demonstrated that Model X-ray has outstanding effectiveness and efficiency across diverse backdoor attacks, datasets, and architectures. Besides, ablation studies on hyperparameters and more attack strategies and discussions are also provided.

Model X-ray:Detecting Backdoored Models via Decision Boundary

Model X-ray: Detecting Backdoored Models Via Decision Boundary

B3: Backdoor Attacks Against Black-box Machine Learning Models

Black-box Detection of Backdoor Attacks with Limited Information and Data

Inspecting Prediction Confidence for Detecting Black-Box Backdoor Attacks

Reverse Backdoor Distillation: Towards Online Backdoor Attack Detection for Deep Neural Network Models

Data-Free Backdoor Model Inspection: Masking and Reverse Engineering Loops for Feature Counting

Backdoor Defense via Decoupling the Training Process

Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

NTD: Non-Transferability Enabled Deep Learning Backdoor Detection

BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

Untargeted Backdoor Attack Against Object Detection

Confidence Matters: Inspecting Backdoors in Deep Neural Networks via Distribution Transfer

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Stand-in Backdoor: A Stealthy and Powerful Backdoor Attack

An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers

Need for Speed: Taming Backdoor Attacks with Speed and Precision

Backdoor Defense Via Deconfounded Representation Learning

Rethinking Backdoor Attacks

Rethinking Backdoor Detection Evaluation for Language Models