Advancing Certified Robustness of Explanation Via Gradient Quantization

Yang Xiao,Zijie Zhang,Yuchen Fang,Da Yan,Yang Zhou,Wei-Shinn Ku,Bo Hui
DOI: https://doi.org/10.1145/3627673.3679650
2024-01-01
Abstract:Explaining black-box models is fundamental to gaining trust and deploying these models in real applications. As existing explanation methods have been shown to lack robustness against adversarial perturbations, there has been a growing interest in generating robust explanations. However, existing works resort to empirical defense strategies and these heuristic methods fail against powerful adversaries. In this paper, we certify the robustness of explanations motivated by the success of randomized smoothing. Specifically, we compute a tight radius in which the robustness of the explanation is certified. While a challenge is how to formulate the robustness of the explanation mathematically, we quantize the explanation into discrete spaces to mimic classification in randomized smoothing. To address the high computational cost of randomized smoothing, we introduce randomized gradient smoothing. Also, we explore the robustness of the semantic explanation by certifying the robustness of capsules. In the experiment, we demonstrate the effectiveness of our method on benchmark datasets from the perspectives of post-hoc explanation and semantic explanation respectively. Our work is a promising step towards filling the gap between the theoretical robustness bound and empirical explanations. Our code has been released at https://github.com/NKUShaw/CertifiedExplanation.
What problem does this paper attempt to address?