Abstract:Current methods for the interpretability of discriminative deep neural networks commonly rely on the model's input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding $p_{\theta} ( y \mid x)$, the model's discriminative capabilities, thus justifying their use for interpretability. However, in this work we show that these input-gradients can be arbitrarily manipulated as a consequence of the shift-invariance of softmax without changing the discriminative function. This leaves an open question: if input-gradients can be arbitrary, why are they highly structured and explanatory in standard models? We investigate this by re-interpreting the logits of standard softmax-based classifiers as unnormalized log-densities of the data distribution and show that input-gradients can be viewed as gradients of a class-conditional density model $p_{\theta}(x \mid y)$ implicit within the discriminative model. This leads us to hypothesize that the highly structured and explanatory nature of input-gradients may be due to the alignment of this class-conditional model $p_{\theta}(x \mid y)$ with that of the ground truth data distribution $p_{\text{data}} (x \mid y)$. We test this hypothesis by studying the effect of density alignment on gradient explanations. To achieve this alignment we use score-matching, and propose novel approximations to this algorithm to enable training large-scale models. Our experiments show that improving the alignment of the implicit density model with the data distribution enhances gradient structure and explanatory power while reducing this alignment has the opposite effect. Overall, our finding that input-gradients capture information regarding an implicit generative model implies that we need to re-think their use for interpreting discriminative models.

Rethinking the Principle of Gradient Smooth Methods in Model Explanation

Axiomatization of Gradient Smoothing in Neural Networks

Statistic-CAM: A Gradient-Free Visual Explanations for Deep Convolutional Network

On Gradient-like Explanation under a Black-box Setting: When Black-box Explanations Become as Good as White-box

Advancing Certified Robustness of Explanation Via Gradient Quantization

Expected Grad-CAM: Towards gradient faithfulness

Gradient based Feature Attribution in Explainable AI: A Technical Review

Global Convergence of Noisy Gradient Descent.

Interpret Gaussian Process Models by Using Integrated Gradients

Rethinking the Role of Gradient-Based Attribution Methods for Model Interpretability

The Manifold Hypothesis for Gradient-Based Explanations

Uncertainty Quantification for Gradient-based Explanations in Neural Networks

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization

Guided AbsoluteGrad: Magnitude of Gradients Matters to Explanation's Localization and Saliency

Gradient Frequency Modulation for Visually Explaining Video Understanding Models

AdaGrad under Anisotropic Smoothness

Towards a Better Understanding of Gradient-Based Explanatory Methods in NLP.

Improved Performance of Stochastic Gradients with Gaussian Smoothing

Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

On Fine-Grained Visual Explanation in Convolutional Neural Networks

Toward a Unified Theory of Gradient Descent under Generalized Smoothness