Abstract:Adversarial examples have raised several open questions, such as why they can deceive classifiers and transfer between different models. A prevailing hypothesis to explain these phenomena suggests that adversarial perturbations appear as random noise but contain class-specific features. This hypothesis is supported by the success of perturbation learning, where classifiers trained solely on adversarial examples and the corresponding incorrect labels generalize well to correctly labeled test data. Although this hypothesis and perturbation learning are effective in explaining intriguing properties of adversarial examples, their solid theoretical foundation is limited. In this study, we theoretically explain the counterintuitive success of perturbation learning. We assume wide two-layer networks and the results hold for any data distribution. We prove that adversarial perturbations contain sufficient class-specific features for networks to generalize from them. Moreover, the predictions of classifiers trained on mislabeled adversarial examples coincide with those of classifiers trained on correctly labeled clean samples. The code is available at <a class="link-external link-https" href="https://github.com/s-kumano/perturbation-learning" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve the following two main problems:
1. **Why can adversarial examples deceive classifiers and transfer across models**:
- Adversarial examples refer to inputs designed with small perturbations, which can deceive machine - learning models. Although adversarial examples have attracted wide attention in practice, the mechanisms behind them are not fully understood yet.
- A popular hypothesis in the literature is the "feature hypothesis", that is, although adversarial perturbations look like random noise, they actually contain class - specific features. This hypothesis provides a unified explanation for why adversarial examples can deceive classifiers and transfer between different models.
2. **The success of perturbation learning and its theoretical basis**:
- Perturbation learning is a training method in which a classifier is trained only with adversarial examples and wrong labels, but can show good generalization ability on clean test data. For example, on the CIFAR - 10 dataset, the classifier achieved an accuracy of 77%.
- Although perturbation learning has been successful in experiments, its theoretical basis is still limited. Previous theoretical studies rely on strict assumptions about data distribution, perturbation design, training process, and model architecture.
### Main contributions of the paper
To better understand and prove the above problems, the author has carried out the following work:
1. **Theoretically support the feature hypothesis**:
- The author has proved that the adversarial perturbation is parallel to the weighted sum of all training samples, which indicates that a single perturbation may contain the information of the entire training data set.
- In some cases (for example, when the training samples are orthogonal to each other), the perturbation can completely contain all training data and label information.
2. **Theoretical explanation of perturbation learning**:
- The author has proved that under three mild conditions, the predictions of the classifier trained based on adversarial perturbations are consistent with those of the classifier trained based on correctly labeled clean samples.
- These conditions explain the success of perturbation learning from geometric and quantitative perspectives.
3. **Relax the assumptions of previous work**:
- The author assumes that a two - layer neural network has sufficient width, but does not make strict assumptions about other aspects (such as data distribution, activation function, etc.), making the results more general.
### Formula summary
- The direction of the adversarial perturbation \( r_n \) is expressed as:
\[
r_n \parallel \frac{1}{N} \sum_{k = 1}^{N} y_k\Phi(x_n,x_k)x_k\int_{0}^{T_f}\ell'(-y_kf(x_k;t))dt+\xi_n
\]
where \( \xi_n \) satisfies \( \|\xi_n\|=\tilde{O}(1) \).
- For the identity loss function \( \ell(s)=s \), the perturbation direction is simplified to:
\[
r_n \parallel T_f\frac{1}{N} \sum_{k = 1}^{N} y_k\Phi(x_n,x_k)x_k+\xi_n
\]
Through these theoretical analyses, the paper provides a more solid theoretical basis for adversarial examples and perturbation learning and explains the reasons for their success.