Abstract:Explaining black-box models is fundamental to gaining trust and deploying these models in real applications. As existing explanation methods have been shown to lack robustness against adversarial perturbations, there has been a growing interest in generating robust explanations. However, existing works resort to empirical defense strategies and these heuristic methods fail against powerful adversaries. In this paper, we certify the robustness of explanations motivated by the success of randomized smoothing. Specifically, we compute a tight radius in which the robustness of the explanation is certified. While a challenge is how to formulate the robustness of the explanation mathematically, we quantize the explanation into discrete spaces to mimic classification in randomized smoothing. To address the high computational cost of randomized smoothing, we introduce randomized gradient smoothing. Also, we explore the robustness of the semantic explanation by certifying the robustness of capsules. In the experiment, we demonstrate the effectiveness of our method on benchmark datasets from the perspectives of post-hoc explanation and semantic explanation respectively. Our work is a promising step towards filling the gap between the theoretical robustness bound and empirical explanations. Our code has been released at https://github.com/NKUShaw/CertifiedExplanation.

Advancing Certified Robustness of Explanation Via Gradient Quantization

Robust Explanation for Free or at the Cost of Faithfulness.

Rigorous Probabilistic Guarantees for Robust Counterfactual Explanations

Evaluations and Methods for Explanation through Robustness Analysis

Towards Robust Visual Explanations for Deep Convolutional Networks with Weight-Wise Perturbations

Provable Robust Saliency-based Explanations

GSmooth: Certified Robustness against Semantic Transformations via Generalized Randomized Smoothing

On Gradient-like Explanation under a Black-box Setting: When Black-box Explanations Become as Good as White-box

Uncertainty Quantification for Gradient-based Explanations in Neural Networks

Trust Regions for Explanations via Black-Box Probabilistic Certification

Constraint-Driven Explanations for Black-Box ML Models

Robust Ranking Explanations

Provably Robust and Plausible Counterfactual Explanations for Neural Networks via Robust Optimisation

Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles

Rethinking the Principle of Gradient Smooth Methods in Model Explanation

Robust Explanations for Visual Question Answering

From Robustness to Explainability and Back Again

Towards Faithful Explanations for Text Classification with Robustness Improvement and Explanation Guided Training

Efficient Contrastive Explanations on Demand

Finding Regions of Counterfactual Explanations via Robust Optimization

Better Verified Explanations with Applications to Incorrectness and Out-of-Distribution Detection