Abstract:Perturbation robustness evaluates the vulnerabilities of models, arising from a variety of perturbations, such as data corruptions and adversarial attacks. Understanding the mechanisms of perturbation robustness is critical for global interpretability. We present a model-agnostic, global mechanistic interpretability method to interpret the perturbation robustness of image models. This research is motivated by two key aspects. First, previous global interpretability works, in tandem with robustness benchmarks, e.g. mean corruption error (mCE), are not designed to directly interpret the mechanisms of perturbation robustness within image models. Second, we notice that the spectral signal-to-noise ratios (SNR) of perturbed natural images exponentially decay over the frequency. This power-law-like decay implies that: Low-frequency signals are generally more robust than high-frequency signals -- yet high classification accuracy can not be achieved by low-frequency signals alone. By applying Shapley value theory, our method axiomatically quantifies the predictive powers of robust features and non-robust features within an information theory framework. Our method, dubbed as \textbf{I-ASIDE} (\textbf{I}mage \textbf{A}xiomatic \textbf{S}pectral \textbf{I}mportance \textbf{D}ecomposition \textbf{E}xplanation), provides a unique insight into model robustness mechanisms. We conduct extensive experiments over a variety of vision models pre-trained on ImageNet to show that \textbf{I-ASIDE} can not only \textbf{measure} the perturbation robustness but also \textbf{provide interpretations} of its mechanisms.

GINT: A Generative Interpretability Method Via Perturbation in the Latent Space

Generative Counterfactuals for Neural Networks Via Attribute-Informed Perturbation

Fidelity of Interpretability Methods and Perturbation Artifacts in Neural Networks

Towards Interpreting Recurrent Neural Networks Through Probabilistic Abstraction

Perturbation on Feature Coalition: Towards Interpretable Deep Neural Networks

Interpreting the Latent Space of GANs via Correlation Analysis for Controllable Concept Manipulation

Interpretation of Neural Networks Is Fragile

Feature Perturbation Augmentation for Reliable Evaluation of Importance Estimators in Neural Networks

Interpretation of Neural Networks is Susceptible to Universal Adversarial Perturbations

Where and What? Examining Interpretable Disentangled Representations

Learning Interpretable Representations with Informative Entanglements.

Interpreting Model Predictions with Constrained Perturbation and Counterfactual Instances

Generative Intervention Models for Causal Perturbation Modeling

Tensor Component Analysis for Interpreting the Latent Space of GANs

GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking

Seeing is Not Always Believing: the Space of Harmless Perturbations

Toward Transparent and Controllable Quantum Generative Models

Interpreting Global Perturbation Robustness of Image Models using Axiomatic Spectral Importance Decomposition

Generative Perturbation Analysis for Probabilistic Black-Box Anomaly Attribution

A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks

Explaining Deep Graph Networks via Input Perturbation