Abstract:Deep neural networks (DNNs) have recently been found to be vulnerable to adversarial examples, which raises concerns about their reliability and poses potential security threats. Adversarial training has been extensively studied to counter adversarial attacks. However, the limited attack types incorporated during the training phase will restrict the defense performance of models against unknown attacks and impact their standard accuracies. Furthermore, we discover that adversarial training models tend to overfit redundant noisy features, which hinders their generalization. To alleviate these issues, this paper proposes the attention information bottleneck-guided knowledge distillation (AIB-KD) method to enhance models' adversarial robustness. We integrate adversarial training with attention information bottleneck as the defense framework to achieve an optimal trade-off between information compression and classification performance. Simultaneously, we specifically employ knowledge distillation to guide the adversarial training models in learning both the standard attention information and valuable deep feature distributions to enhance their defense generalization capability. Experimental results demonstrate that AIB-KD can effectively classify adversarial examples in multiple attack settings. The average white-box and black-box classification accuracies for the WideResNet-28-10 model on the CIFAR-10 dataset are 56.59% and 85.49%, respectively, and the average accuracies on the SVHN dataset are 61.71% and 88.96%. When applied to unknown attack scenarios, AIB-KD is more effective and interpretable than state-of-the-art methods.

Distilling Knowledge in Adversarial Attack

Model Mimic Attack: Knowledge Distillation for Provably Transferable Adversarial Examples

Understanding and Enhancing the Transferability of Adversarial Examples

Towards Evaluating the Robustness of Neural Networks

Bag of Tricks to Boost Adversarial Transferability

Common Knowledge Learning for Generating Transferable Adversarial Examples

A Survey on Transferability of Adversarial Examples across Deep Neural Networks

Evading Defenses to Transferable Adversarial Examples by Translation-Invariant Attacks

Toward Understanding and Boosting Adversarial Transferability from a Distribution Perspective

On the Transferability of Adversarial Attacksagainst Neural Text Classifier

Enhancing the Transferability of Adversarial Examples with Noise Reduced Gradient

Improving adversarial robustness using knowledge distillation guided by attention information bottleneck

Delving into Transferable Adversarial Examples and Black-box Attacks

Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks

Enhanced Attacks on Defensively Distilled Deep Neural Networks.

Adversarially Robust Distillation

Improving Adversarial Transferability with Neighbourhood Gradient Information

Improving Transferability of Adversarial Examples With Input Diversity

Evading Defenses to Transferable Adversarial Examples by Mitigating Attention Shift

Det: Defending Against Adversarial Examples Via Decreasing Transferability

Anti-Distillation Backdoor Attacks: Backdoors Can Really Survive in Knowledge Distillation