Abstract:Adversarial training is a widely-applied approach to training deep neural networks to be robust against adversarial perturbation. However, although adversarial training has achieved empirical success in practice, it still remains unclear why adversarial examples exist and how adversarial training methods improve model robustness. In this paper, we provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory. Specifically, we focus on a multiple classification setting, where the structured data can be composed of two types of features: the robust features, which are resistant to perturbation but sparse, and the non-robust features, which are susceptible to perturbation but dense. We train a two-layer smoothed ReLU convolutional neural network to learn our structured data. First, we prove that by using standard training (gradient descent over the empirical risk), the network learner primarily learns the non-robust feature rather than the robust feature, which thereby leads to the adversarial examples that are generated by perturbations aligned with negative non-robust feature directions. Then, we consider the gradient-based adversarial training algorithm, which runs gradient ascent to find adversarial examples and runs gradient descent over the empirical risk at adversarial examples to update models. We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness. Finally, we also empirically validate our theoretical findings with experiments on real-image datasets, including MNIST, CIFAR10 and SVHN.

Understanding Robust Overfitting of Adversarial Training and Beyond

GAAT: Group Adaptive Adversarial Training to Improve the Trade-Off Between Robustness and Accuracy

Understanding Robust Overfitting from the Feature Generalization Perspective

Balance, Imbalance, and Rebalance: Understanding Robust Overfitting from a Minimax Game Perspective

Overfitting in adversarially robust deep learning

Feature Augmentation for Adversarial Robustness

Attacks Which Do Not Kill Training Make Adversarial Learning Stronger

Robust Weight Perturbation for Adversarial Training

Towards Understanding Clean Generalization and Robust Overfitting in Adversarial Training

Adversarial Robustness under Long-Tailed Distribution

Understanding and Mitigating Robust Overfitting through the Lens of Feature Dynamics

Strength-Adaptive Adversarial Training

The Surprising Harmfulness of Benign Overfitting for Adversarial Robustness

Adversarial Distributional Training for Robust Deep Learning

Improving Adversarial Robustness Requires Revisiting Misclassified Examples.

Rethinking Robust Contrastive Learning from the Adversarial Perspective

Towards A Unified Min-Max Framework for Adversarial Exploration and Robustness

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

CAT: Customized Adversarial Training for Improved Robustness

Adversarial Masking: Towards Understanding Robustness Trade-off for Generalization

Dynamic Label Adversarial Training for Deep Learning Robustness Against Adversarial Attacks