Local Adversarial Attacks for Understanding Model Decisions.

Wei Shi,Wentao Zhang,Ruixuan Wang
DOI: https://doi.org/10.1145/3603273.3630044
2023-01-01
Abstract:Recently, deep learning models have demonstrated outstanding performance, but the lack of interpretability makes it difficult to apply them to high-risk tasks. Many existing post-hoc explanation methods have been proposed to explain specific model predictions, but they mostly focus on which locations influence the model decision. In other words, they only provide the location information of discriminative regions but cannot provide the visual differences of the relevant categories in these regions. In this paper, a simple yet effective approach is proposed to explain specific model prediction, which uses visual differences between different categories to provide more intuitive explanations. The basic idea is to use adversarial attacks to transform discriminative features in local regions of an image into features of another class, thereby achieving an explanation of the model's predictions. In this way, the location difference between the original class and the target class in the image will be localized, and an intuitive explanation will be provided through visual changes within the region. Extensive evaluation demonstrates that the proposed method can accurately and adaptively locate discriminative regions in images, and provide users with an intuitive explanation of the classification basis for model predictions through visualization results.
What problem does this paper attempt to address?