Attention Masks Help Adversarial Attacks to Bypass Safety Detectors

Yunfan Shi
2024-11-07
Abstract:Despite recent research advancements in adversarial attack methods, current approaches against XAI monitors are still discoverable and slower. In this paper, we present an adaptive framework for attention mask generation to enable stealthy, explainable and efficient PGD image classification adversarial attack under XAI monitors. Specifically, we utilize mutation XAI mixture and multitask self-supervised X-UNet for attention mask generation to guide PGD attack. Experiments on MNIST (MLP), CIFAR-10 (AlexNet) have shown that our system can outperform benchmark PGD, Sparsefool and SOTA SINIFGSM in balancing among stealth, efficiency and explainability which is crucial for effectively fooling SOTA defense protected classifiers.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that current adversarial attack methods still have problems of being not covert enough, having low efficiency and being difficult to interpret when facing Explainable Artificial Intelligence (XAI) monitoring. Specifically: 1. **Lack of covertness**: Existing adversarial attack methods (such as PGD) are easily detected when passing through XAI monitoring. 2. **Low efficiency**: These methods are slow in generating adversarial samples. 3. **Lack of interpretability**: It is difficult to explain the working principles behind the adversarial samples generated by existing methods. To solve these problems, the author proposes an adaptive framework that uses attention mask generation to achieve covert, efficient and interpretable adversarial attacks. Specific techniques include: - Using the multi - task self - supervised X - UNet model to generate attention masks to guide PGD attacks. - Combining the mutant XAI hybrid algorithm and multi - task self - supervised learning to improve the covertness and efficiency of attacks. - Generating partial attention masks through methods such as Integrated Gradient and Layer - wise Relevance Propagation (LRP) to guide PGD attacks or train X - UNet as partial labels. Experimental results show that this method can significantly improve the covertness, efficiency and interpretability of adversarial attacks on the MNIST and CIFAR - 10 datasets, and is superior to existing methods such as the benchmark PGD, Sparsefool and SINIFGSM. ### Formula summary 1. **Loss function**: \[ \text{Loss}=\lambda_1\cdot L_1(\text{advm}, \text{data})+\lambda_2\cdot L_1(\text{mask}, \text{mix})+\lambda_2\cdot\delta_{\text{acc}} \] where $\lambda_1, \lambda_2, \lambda_3$ are weight parameters, $L_1$ represents the L1 loss, and $\delta_{\text{acc}}$ represents the change in accuracy. 2. **Activation function**: \[ \text{SLU}(x, a = 0.5)=\max(0, x)+a\cdot\sin(x) \] 3. **Convolution weight initialization**: \[ \text{Normal distribution with variance}=\sigma^2 \] Through these improvements, the author has successfully improved the performance of adversarial attacks in terms of covertness, efficiency and interpretability, thus more effectively bypassing existing XAI monitoring systems.