What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **A vulnerability in attribution methods based on convolutional neural networks (CNNs) when using pre - softmax scores**. ### Specific problem description: 1. **Background of attribution methods**: - Attribution methods are used to explain the output of convolutional neural networks (CNNs) when they are used as classifiers, helping to understand the decision - making process of the model. - These methods generate heatmaps by calculating the influence of the input on the model output to show which parts of the input contribute the most to the final output. 2. **Known problems**: - CNNs are vulnerable to adversarial attacks, that is, by making small perturbations to the input, the output of the model can be changed. - Previous research has mainly focused on how to change the output of attribution methods by perturbing the input or model parameters without affecting the prediction results of the model. 3. **New problems focused on in this paper**: - The authors of this paper discovered a new vulnerability: by modifying the pre - softmax scores of the model, the heatmap generated by the attribution method can be significantly changed without changing the final output of the model. - This vulnerability makes the attribution method unreliable because they may mislead users into thinking that certain input regions have an important impact on the model output, while in fact these regions may be irrelevant. ### Solutions and experimental verification: - **Theoretical analysis**: - The authors proved through mathematical derivation that adding a category - independent constant \( t \) to the pre - softmax scores will not change the output probability distribution after softmax, but will affect the gradient calculation of the attribution method. - The specific formulas are as follows: \[ y_c=\frac{e^{z_c}}{\sum_{i = 1}^n e^{z_i}} \] After adding the constant \( t \): \[ y'_c=\frac{e^{z_c + t}}{\sum_{i = 1}^n e^{z_i + t}}=\frac{e^t e^{z_c}}{e^t\sum_{i = 1}^n e^{z_i}}=y_c \] However, the gradient will change: \[ \frac{\partial z'_i}{\partial x}=\frac{\partial(z_i + t)}{\partial x}=\frac{\partial z_i}{\partial x}+\frac{\partial t}{\partial x} \] - **Experimental verification**: - The authors showed how to use this vulnerability to change the heatmap generated by the attribution method by modifying the activation values of the last pooling layer of the VGG19 network. - The experimental results show that the heatmap generated by the attribution method using pre - softmax scores (such as Grad - CAM) is significantly distorted, while the attribution method using post - softmax scores remains unchanged. ### Conclusions: - This paper reveals a potential security vulnerability in attribution methods based on pre - softmax scores. This vulnerability can be exploited by modifying certain parameters inside the model without changing the final output of the model. - Attribution methods using post - softmax scores are more robust against such attacks. - Future research can further explore whether this problem applies to a wider range of attribution methods and propose corresponding defense measures.

A Vulnerability of Attribution Methods Using Pre-Softmax Scores

Pre or Post-Softmax Scores in Gradient-based Attribution Methods, What is Best?

How the Softmax Output is Misleading for Evaluating the Strength of Adversarial Examples

Exploiting the Relationship Between Kendall's Rank Correlation and Cosine Similarity for Attribution Protection

Certified $\ell_2$ Attribution Robustness via Uniformly Smoothed Attributions

Brain Programming is Immune to Adversarial Attacks: Towards Accurate and Robust Image Classification using Symbolic Learning

Identifying the Source of Vulnerability in Explanation Discrepancy: A Case Study in Neural Text Classification

An Interpretive Adversarial Attack Method: Attacking Softmax Gradient Layer-Wise Relevance Propagation Based on Cosine Similarity Constraint and TS-Invariant

Softmax-based Classification is k-means Clustering: Formal Proof, Consequences for Adversarial Attacks, and Improvement through Centroid Based Tailoring

A Statistical Physics Perspective: Understanding the Causality Behind Convolutional Neural Network Adversarial Vulnerability

With Friends Like These, Who Needs Adversaries?

Harnessing the Vulnerability of Latent Layers in Adversarially Trained Models

Adversarial Attack Attribution: Discovering Attributable Signals in Adversarial ML Attacks

Explaining and Harnessing Adversarial Examples

Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis

Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations

A Practical Upper Bound for the Worst-Case Attribution Deviations

Partially Recentralization Softmax Loss for Vision-Language Models Robustness

Misleading Authorship Attribution of Source Code using Adversarial Learning

Restricted-Area Adversarial Example Attack for Image Captioning Model

Exploring Layerwise Adversarial Robustness Through the Lens of t-SNE