Abstract:The remarkable advances in deep learning have led to the emergence of many off-the-shelf classifiers, e.g., large pre-trained models. However, since they are typically trained on clean data, they remain vulnerable to adversarial attacks. Despite this vulnerability, their superior performance and transferability make off-the-shelf classifiers still valuable in practice, demanding further work to provide adversarial robustness for them in a post-hoc manner. A recently proposed method, denoised smoothing, leverages a denoiser model in front of the classifier to obtain provable robustness without additional training. However, the denoiser often creates hallucination, i.e., images that have lost the semantics of their originally assigned class, leading to a drop in robustness. Furthermore, its noise-and-denoise procedure introduces a significant distribution shift from the original distribution, causing the denoised smoothing framework to achieve sub-optimal robustness. In this paper, we introduce Fine-Tuning with Confidence-Aware Denoised Image Selection (FT-CADIS), a novel fine-tuning scheme to enhance the certified robustness of off-the-shelf classifiers. FT-CADIS is inspired by the observation that the confidence of off-the-shelf classifiers can effectively identify hallucinated images during denoised smoothing. Based on this, we develop a confidence-aware training objective to handle such hallucinated images and improve the stability of fine-tuning from denoised images. In this way, the classifier can be fine-tuned using only images that are beneficial for adversarial robustness. We also find that such a fine-tuning can be done by updating a small fraction of parameters of the classifier. Extensive experiments demonstrate that FT-CADIS has established the state-of-the-art certified robustness among denoised smoothing methods across all $\ell_2$-adversary radius in various benchmarks.

Carefully Blending Adversarial Training and Purification Improves Adversarial Robustness

Boosting Adversarial Training in Safety-Critical Systems Through Boundary Data Selection

NCIS: Neural Contextual Iterative Smoothing for Purifying Adversarial Perturbations

Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing

Edge Enhancement Improves Adversarial Robustness in Image Classification

Towards Improving Robustness Against Common Corruptions in Object Detectors Using Adversarial Contrastive Learning

Towards Robustness against Unsuspicious Adversarial Examples

Towards Robustifying Image Classifiers against the Perils of Adversarial Attacks on Artificial Intelligence Systems

Robust width: A lightweight and certifiable adversarial defense

Improved Adversarial Training Through Adaptive Instance-wise Loss Smoothing

Improving Adversarial Robustness via Attention and Adversarial Logit Pairing

Adversarial Training on Purification (AToP): Advancing Both Robustness and Generalization

Boosting adversarial robustness via feature refinement, suppression, and alignment

Stratified Adversarial Robustness with Rejection

Robustness through Cognitive Dissociation Mitigation in Contrastive Adversarial Training

Confidence-aware Denoised Fine-tuning of Off-the-shelf Models for Certified Robustness

Towards Bridging the gap between Empirical and Certified Robustness against Adversarial Examples

Feature Denoising for Improving Adversarial Robustness

Adversarial Visual Robustness by Causal Intervention

On the Robustness of Adversarial Training Against Uncertainty Attacks

Boosting Adversarial Robustness Via Self-Paced Adversarial Training.