Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment

Kejia Zhang,Juanjuan Weng,Zhiming Luo,Shaozi Li
2024-08-12
Abstract:Despite the significant advances that deep neural networks (DNNs) have achieved in various visual tasks, they still exhibit vulnerability to adversarial examples, leading to serious security concerns. Recent adversarial training techniques have utilized inverse adversarial attacks to generate high-confidence examples, aiming to align the distributions of adversarial examples with the high-confidence regions of their corresponding classes. However, in this paper, our investigation reveals that high-confidence outputs under inverse adversarial attacks are correlated with biased feature activation. Specifically, training with inverse adversarial examples causes the model's attention to shift towards background features, introducing a spurious correlation bias. To address this bias, we propose Debiased High-Confidence Adversarial Training (DHAT), a novel approach that not only aligns the logits of adversarial examples with debiased high-confidence logits obtained from inverse adversarial examples, but also restores the model's attention to its normal state by enhancing foreground logit orthogonality. Extensive experiments demonstrate that DHAT achieves state-of-the-art performance and exhibits robust generalization capabilities across various vision datasets. Additionally, DHAT can seamlessly integrate with existing advanced adversarial training techniques for improving the performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the vulnerability of deep neural networks (DNNs) when facing adversarial examples, especially the feature activation bias problem introduced when using reverse adversarial examples for training. Specifically, although the existing adversarial training techniques align the distribution of adversarial examples with that of natural examples by generating reverse adversarial examples with high confidence, these high - confidence outputs are related to the activation bias of background features, causing the model's attention to shift from foreground features to background features, thus introducing spurious correlation bias. This bias not only affects the generalization ability of the model but also reduces the robustness of the model. To solve this problem, the paper proposes Debiased High - Confidence Adversarial Training (DHAT), which alleviates the spurious correlation bias through the following two key techniques: 1. **Debiased High - Confidence Logit Regularization (DHLR)**: - Quantify the degree of biased activation of background features by the model under reverse adversarial attacks. - Calibrate the biased high - confidence logits by subtracting the logarithm of background features. - Introduce a regularization term to align the logits of adversarial examples with the de - biased high - confidence logits, thereby reducing the biased activation of background features and improving adversarial robustness. 2. **Foreground Logit Orthogonal Enhancement (FLOE)**: - Enhance the model's focus on foreground features by minimizing the correlation between high - confidence logits and background feature logits. - Further alleviate the feature activation bias under reverse adversarial attacks by reducing the projection of high - confidence logits on background feature logits in the affine space. Through these techniques, DHAT not only improves the adversarial robustness and generalization ability of the model on various visual data sets but can also be seamlessly integrated into existing advanced adversarial training methods to further improve performance. Experimental results show that DHAT has achieved state - of - the - art performance on multiple data sets and significantly reduced the robust generalization gap.