A Vulnerability of Attribution Methods Using Pre-Softmax Scores

Miguel Lerma,Mirtha Lucas
2024-04-09
Abstract:We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **A vulnerability in attribution methods based on convolutional neural networks (CNNs) when using pre - softmax scores**. ### Specific problem description: 1. **Background of attribution methods**: - Attribution methods are used to explain the output of convolutional neural networks (CNNs) when they are used as classifiers, helping to understand the decision - making process of the model. - These methods generate heatmaps by calculating the influence of the input on the model output to show which parts of the input contribute the most to the final output. 2. **Known problems**: - CNNs are vulnerable to adversarial attacks, that is, by making small perturbations to the input, the output of the model can be changed. - Previous research has mainly focused on how to change the output of attribution methods by perturbing the input or model parameters without affecting the prediction results of the model. 3. **New problems focused on in this paper**: - The authors of this paper discovered a new vulnerability: by modifying the pre - softmax scores of the model, the heatmap generated by the attribution method can be significantly changed without changing the final output of the model. - This vulnerability makes the attribution method unreliable because they may mislead users into thinking that certain input regions have an important impact on the model output, while in fact these regions may be irrelevant. ### Solutions and experimental verification: - **Theoretical analysis**: - The authors proved through mathematical derivation that adding a category - independent constant \( t \) to the pre - softmax scores will not change the output probability distribution after softmax, but will affect the gradient calculation of the attribution method. - The specific formulas are as follows: \[ y_c=\frac{e^{z_c}}{\sum_{i = 1}^n e^{z_i}} \] After adding the constant \( t \): \[ y'_c=\frac{e^{z_c + t}}{\sum_{i = 1}^n e^{z_i + t}}=\frac{e^t e^{z_c}}{e^t\sum_{i = 1}^n e^{z_i}}=y_c \] However, the gradient will change: \[ \frac{\partial z'_i}{\partial x}=\frac{\partial(z_i + t)}{\partial x}=\frac{\partial z_i}{\partial x}+\frac{\partial t}{\partial x} \] - **Experimental verification**: - The authors showed how to use this vulnerability to change the heatmap generated by the attribution method by modifying the activation values of the last pooling layer of the VGG19 network. - The experimental results show that the heatmap generated by the attribution method using pre - softmax scores (such as Grad - CAM) is significantly distorted, while the attribution method using post - softmax scores remains unchanged. ### Conclusions: - This paper reveals a potential security vulnerability in attribution methods based on pre - softmax scores. This vulnerability can be exploited by modifying certain parameters inside the model without changing the final output of the model. - Attribution methods using post - softmax scores are more robust against such attacks. - Future research can further explore whether this problem applies to a wider range of attribution methods and propose corresponding defense measures.