Abstract:Despite the significant advances that deep neural networks (DNNs) have achieved in various visual tasks, they still exhibit vulnerability to adversarial examples, leading to serious security concerns. Recent adversarial training techniques have utilized inverse adversarial attacks to generate high-confidence examples, aiming to align the distributions of adversarial examples with the high-confidence regions of their corresponding classes. However, in this paper, our investigation reveals that high-confidence outputs under inverse adversarial attacks are correlated with biased feature activation. Specifically, training with inverse adversarial examples causes the model's attention to shift towards background features, introducing a spurious correlation bias. To address this bias, we propose Debiased High-Confidence Adversarial Training (DHAT), a novel approach that not only aligns the logits of adversarial examples with debiased high-confidence logits obtained from inverse adversarial examples, but also restores the model's attention to its normal state by enhancing foreground logit orthogonality. Extensive experiments demonstrate that DHAT achieves state-of-the-art performance and exhibits robust generalization capabilities across various vision datasets. Additionally, DHAT can seamlessly integrate with existing advanced adversarial training techniques for improving the performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the vulnerability of deep neural networks (DNNs) when facing adversarial examples, especially the feature activation bias problem introduced when using reverse adversarial examples for training. Specifically, although the existing adversarial training techniques align the distribution of adversarial examples with that of natural examples by generating reverse adversarial examples with high confidence, these high - confidence outputs are related to the activation bias of background features, causing the model's attention to shift from foreground features to background features, thus introducing spurious correlation bias. This bias not only affects the generalization ability of the model but also reduces the robustness of the model. To solve this problem, the paper proposes Debiased High - Confidence Adversarial Training (DHAT), which alleviates the spurious correlation bias through the following two key techniques: 1. **Debiased High - Confidence Logit Regularization (DHLR)**: - Quantify the degree of biased activation of background features by the model under reverse adversarial attacks. - Calibrate the biased high - confidence logits by subtracting the logarithm of background features. - Introduce a regularization term to align the logits of adversarial examples with the de - biased high - confidence logits, thereby reducing the biased activation of background features and improving adversarial robustness. 2. **Foreground Logit Orthogonal Enhancement (FLOE)**: - Enhance the model's focus on foreground features by minimizing the correlation between high - confidence logits and background feature logits. - Further alleviate the feature activation bias under reverse adversarial attacks by reducing the projection of high - confidence logits on background feature logits in the affine space. Through these techniques, DHAT not only improves the adversarial robustness and generalization ability of the model on various visual data sets but can also be seamlessly integrated into existing advanced adversarial training methods to further improve performance. Experimental results show that DHAT has achieved state - of - the - art performance on multiple data sets and significantly reduced the robust generalization gap.

Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment

Enhancing Robust Representation in Adversarial Training: Alignment and Exclusion Criteria

Dual Head Adversarial Training.

Improving Adversarial Robustness via Attention and Adversarial Logit Pairing

ADNet: Leveraging Error-Bias Towards Normal Direction in Face Alignment

Improving Model Robustness Against Adversarial Examples with Redundant Fully Connected Layer.

Attack As Defense: Characterizing Adversarial Examples Using Robustness.

Adversarial Training with Bi-directional Likelihood Regularization for Visual Classification

RobustFair: Adversarial Evaluation through Fairness Confusion Directed Gradient Search

Self-adaptive logit balancing for deep neural network robustness: Defence and detection of adversarial attacks

General Adversarial Defense via Pixel Level and Feature Level Distribution Alignment

Improving Adversarial Robustness via Decoupled Visual Representation Masking

Adversarial Robustness under Long-Tailed Distribution

Local Competition and Uncertainty for Adversarial Robustness in Deep Learning

Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training

Enhancing Adversarial Robustness via Uncertainty-Aware Distributional Adversarial Training

A hybrid adversarial training for deep learning model and denoising network resistant to adversarial examples

Adversarial robustness improvement for deep neural networks

Dynamic Label Adversarial Training for Deep Learning Robustness Against Adversarial Attacks

Improving the Robustness of Adversarial Attacks Using an Affine-Invariant Gradient Estimator

Improving the Robustness and Generalization of Deep Neural Network with Confidence Threshold Reduction