Joint Universal Adversarial Perturbations with Interpretations

Liang-bo Ning,Zeyu Dai,Wenqi Fan,Jingran Su,Chao Pan,Luning Wang,Qing Li
2024-08-03
Abstract:Deep neural networks (DNNs) have significantly boosted the performance of many challenging tasks. Despite the great development, DNNs have also exposed their vulnerability. Recent studies have shown that adversaries can manipulate the predictions of DNNs by adding a universal adversarial perturbation (UAP) to benign samples. On the other hand, increasing efforts have been made to help users understand and explain the inner working of DNNs by highlighting the most informative parts (i.e., attribution maps) of samples with respect to their predictions. Moreover, we first empirically find that such attribution maps between benign and adversarial examples have a significant discrepancy, which has the potential to detect universal adversarial perturbations for defending against adversarial attacks. This finding motivates us to further investigate a new research problem: whether there exist universal adversarial perturbations that are able to jointly attack DNNs classifier and its interpretation with malicious desires. It is challenging to give an explicit answer since these two objectives are seemingly conflicting. In this paper, we propose a novel attacking framework to generate joint universal adversarial perturbations (JUAP), which can fool the DNNs model and misguide the inspection from interpreters simultaneously. Comprehensive experiments on various datasets demonstrate the effectiveness of the proposed method JUAP for joint attacks. To the best of our knowledge, this is the first effort to study UAP for jointly attacking both DNNs and interpretations.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly explores how to generate Joint Universal Adversarial Perturbations (JUAP) that can simultaneously attack deep neural network (DNN) classifiers and their explanation mechanisms. Specifically, the paper attempts to solve the following key problems: 1. **Vulnerability of DNNs**: - Although DNNs perform well in many tasks, they also show vulnerability to adversarial attacks. By adding Universal Adversarial Perturbations (UAP), attackers can manipulate the prediction results of DNNs. 2. **Reliability of explanation mechanisms**: - To enhance the credibility of DNNs, researchers have developed various explanation mechanisms (such as CAM, GradCAM, and RTS) to reveal the key parts of model decisions. However, these explanation mechanisms can also be misled by adversarial samples. 3. **Possibility of joint attacks**: - For the first time, the paper proposes and studies a new problem: whether there is a universal adversarial perturbation that can mislead the explanation mechanism while attacking the DNN classifier? This problem is challenging because these two goals seem to be in conflict - changing the prediction usually requires affecting the most significant parts of the image, and these parts are also the focus of the explanation mechanism. 4. **Limitations of defense mechanisms**: - Researchers have found that existing explanation mechanisms can be used as tools to detect adversarial attacks. Therefore, the paper further explores how to design an adversarial perturbation that can simultaneously deceive the classifier and the explanation mechanism to evade this detection mechanism. ### Solutions To solve the above problems, the paper proposes a novel attack framework JUAP, which uses Generative Adversarial Networks (GAN) to learn universal perturbations that can mislead the DNN's prediction without changing the explanation graph. Specific methods include: - **Generating adversarial perturbations**: Generate universal perturbations that can simultaneously deceive the classifier and the explanation mechanism through an iterative optimization strategy. - **Maintaining explanation consistency**: Ensure that the generated perturbations do not significantly change the explanation graph, thereby avoiding detection by the explanation mechanism. - **Experimental verification**: Conduct experiments on multiple datasets to verify the effectiveness of the proposed method. ### Main contributions - **Discovering the application potential of explanation mechanisms**: Prove that the explanation mechanisms of DNNs can be used to detect adversarial attacks, thereby improving the security of the model. - **Exploring new attack problems**: For the first time, study the universal adversarial perturbation that can simultaneously attack the DNN classifier and the explanation mechanism. - **Proposing a novel attack framework**: Develop the JUAP framework, which can generate adversarial samples that can mislead both the classifier and the explanation mechanism. - **Experimental verification**: Through extensive experiments, demonstrate the effectiveness and robustness of JUAP on different datasets. In summary, this paper not only reveals the potential security problems of DNNs and their explanation mechanisms but also proposes a brand - new attack framework, providing an important reference for future research.