Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

Binghui Li,Yuanzhi Li

2024-10-11

Abstract:Adversarial training is a widely-applied approach to training deep neural networks to be robust against adversarial perturbation. However, although adversarial training has achieved empirical success in practice, it still remains unclear why adversarial examples exist and how adversarial training methods improve model robustness. In this paper, we provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory. Specifically, we focus on a multiple classification setting, where the structured data can be composed of two types of features: the robust features, which are resistant to perturbation but sparse, and the non-robust features, which are susceptible to perturbation but dense. We train a two-layer smoothed ReLU convolutional neural network to learn our structured data. First, we prove that by using standard training (gradient descent over the empirical risk), the network learner primarily learns the non-robust feature rather than the robust feature, which thereby leads to the adversarial examples that are generated by perturbations aligned with negative non-robust feature directions. Then, we consider the gradient-based adversarial training algorithm, which runs gradient ascent to find adversarial examples and runs gradient descent over the empirical risk at adversarial examples to update models. We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness. Finally, we also empirically validate our theoretical findings with experiments on real-image datasets, including MNIST, CIFAR10 and SVHN.

Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily explores the theoretical foundation of adversarial training in enhancing the robustness of neural networks. Specifically, the paper attempts to answer the following two core questions: 1. **Why does standard training lead neural networks to converge to non-robust solutions?** - Standard training (such as gradient descent) tends to make the network learn non-robust features rather than robust features, thereby leading to the existence of adversarial examples. 2. **How does the adversarial training algorithm help optimize neural networks to enhance their adversarial robustness?** - Adversarial training methods can effectively suppress the learning of non-robust features and enhance the learning of robust features, thereby improving the overall robustness of the network. The paper analyzes this process by decomposing data into two types of features—robust features (resistant to perturbation but sparse) and non-robust features (susceptible to perturbation but dense). Through theoretical analysis and experimental validation, it is demonstrated that adversarial training can significantly improve the robustness of neural networks against adversarial perturbations.

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

Feature Augmentation for Adversarial Robustness

Learning More Robust Features with Adversarial Training

Improving Adversarial Robustness of Deep Neural Networks Via Linear Programming

ROBUSTNESS OF DEEP NEURAL NETWORKS TO ADVERSARIAL EXAMPLES

Enhancing Robust Representation in Adversarial Training: Alignment and Exclusion Criteria

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation

Towards Robust Training of Neural Networks by Regularizing Adversarial Gradients

Adversarial Training with Anti-adversaries

Robust Training with Feature-Based Adversarial Example

Splitting the Difference on Adversarial Training

Exploring Robust Features for Improving Adversarial Robustness

Adversarial Training with Bi-directional Likelihood Regularization for Visual Classification

Towards Both Accurate and Robust Neural Networks Without Extra Data

Toward Adversarial Robustness via Semi-supervised Robust Training

Toward Intrinsic Adversarial Robustness Through Probabilistic Training.

Improving Adversarial Robustness Requires Revisiting Misclassified Examples.

Improving adversarial robustness of deep neural networks by using semantic information

Robustness, Privacy, and Generalization of Adversarial Training

Adversarial Training of Deep Neural Networks Guided by Texture and Structural Information

Improving Adversarial Robustness via Attention and Adversarial Logit Pairing