Abstract:Adversarial training is a widely-applied approach to training deep neural networks to be robust against adversarial perturbation. However, although adversarial training has achieved empirical success in practice, it still remains unclear why adversarial examples exist and how adversarial training methods improve model robustness. In this paper, we provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory. Specifically, we focus on a multiple classification setting, where the structured data can be composed of two types of features: the robust features, which are resistant to perturbation but sparse, and the non-robust features, which are susceptible to perturbation but dense. We train a two-layer smoothed ReLU convolutional neural network to learn our structured data. First, we prove that by using standard training (gradient descent over the empirical risk), the network learner primarily learns the non-robust feature rather than the robust feature, which thereby leads to the adversarial examples that are generated by perturbations aligned with negative non-robust feature directions. Then, we consider the gradient-based adversarial training algorithm, which runs gradient ascent to find adversarial examples and runs gradient descent over the empirical risk at adversarial examples to update models. We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness. Finally, we also empirically validate our theoretical findings with experiments on real-image datasets, including MNIST, CIFAR10 and SVHN.

Towards Robust Training of Neural Networks by Regularizing Adversarial Gradients

Towards Robust DNNs: an Taylor Expansion-Based Method for Generating Powerful Adversarial Examples.

Learning More Robust Features with Adversarial Training

Feature Augmentation for Adversarial Robustness

Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing Their Input Gradients

Enhancing Adversarial Robustness in SNNs with Sparse Gradients

Adaptive Retraining for Neural Network Robustness in Classification

Training Robust Deep Neural Networks via Adversarial Noise Propagation

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

Interpreting and Improving Adversarial Robustness of Deep Neural Networks With Neuron Sensitivity

Robust Sparse Regularization: Simultaneously Optimizing Neural Network Robustness and Compactness

A survey of robust adversarial training in pattern recognition: Fundamental, theory, and methodologies

A Direct Approach to Robust Deep Learning Using Adversarial Networks

Improving Adversarial Robustness Requires Revisiting Misclassified Examples.

Robustra: Training Provable Robust Neural Networks over Reference Adversarial Space.

Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis

DeepDefense: Training Deep Neural Networks with Improved Robustness.

Globally-Robust Neural Networks

Towards Deep Learning Models Resistant to Adversarial Attacks

An efficient adversarial example generation algorithm based on an accelerated gradient iterative fast gradient

Regularization for Adversarial Robust Learning