Improving Adversarial Transferability via Model Alignment

Avery Ma,Amir-massoud Farahmand,Yangchen Pan,Philip Torr,Jindong Gu
2024-07-17
Abstract:Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric analysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the transferability of adversarial perturbations. Specifically, the author proposes a new model - alignment technique, aiming to improve the ability of a given source model to generate transferable adversarial perturbations. ### Problem Background Neural networks are vulnerable to adversarial perturbations, which can be transferred between different models, that is, the perturbations generated when attacking one model can also deceive other models. This adversarial transferability has attracted wide attention in practical applications because it poses security challenges to the deployment of machine - learning systems. ### Paper Goals The author's goal is to enable the source model to generate more transferable adversarial perturbations through the model - alignment technique. The specific method is to fine - tune the parameters of the source model to minimize the alignment loss, which measures the prediction difference between the source model and another independently trained model (called the witness model). ### Main Contributions 1. **Model - Alignment Method**: Fine - tune the parameters of the source model by minimizing the alignment loss (such as KL - divergence) to make the generated adversarial perturbations more transferable. 2. **Geometric Analysis**: Studied the impact of the alignment process on the loss landscape and found that the loss surface of the aligned model is smoother, so the generated adversarial perturbations are more transferable. 3. **Experimental Verification**: Conducted a large number of experiments on the ImageNet dataset, using multiple model architectures, and proved that the adversarial perturbations generated by the aligned source model do have higher transferability, and this method is compatible with multiple attack algorithms. ### Mathematical Formulas - Alignment Loss: \[ \ell_a(x, \theta_s, \theta_w) = d(z_q^s(x), z_q^w(x)) \] where \(d\) is a measure of the output difference between the two models at the \(q\) - th layer, \(\theta_s\) and \(\theta_w\) are the parameters of the source model and the witness model respectively, and \(z_q^s(x)\) and \(z_q^w(x)\) are the outputs of the source model and the witness model at the \(q\) - th layer respectively. - KL - Divergence (used to measure the difference in probability distributions): \[ D_{\text{KL}}(P \parallel Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)} \] ### Conclusion By aligning the source model, the transferability of the adversarial perturbations it generates can be significantly improved, which provides new ideas and methods for improving the security and robustness of deep - learning models.