Abstract:Providing explanations for deep neural network (DNN) models is crucial for their use in security-sensitive domains. A plethora of interpretation models have been proposed to help users understand the inner workings of DNNs: how does a DNN arrive at a specific decision for a given input? The improved interpretability is believed to offer a sense of security by involving human in the decision-making process. Yet, due to its data-driven nature, the interpretability itself is potentially susceptible to malicious manipulations, about which little is known thus far. Here we bridge this gap by conducting the first systematic study on the security of interpretable deep learning systems (IDLSes). We show that existing IDLSes are highly vulnerable to adversarial manipulations. Specifically, we present ADV2, a new class of attacks that generate adversarial inputs not only misleading target DNNs but also deceiving their coupled interpretation models. Through empirical evaluation against four major types of IDLSes on benchmark datasets and in security-critical applications (e.g., skin cancer diagnosis), we demonstrate that with ADV2 the adversary is able to arbitrarily designate an input's prediction and interpretation. Further, with both analytical and empirical evidence, we identify the prediction-interpretation gap as one root cause of this vulnerability - a DNN and its interpretation model are often misaligned, resulting in the possibility of exploiting both models simultaneously. Finally, we explore potential countermeasures against ADV2, including leveraging its low transferability and incorporating it in an adversarial training framework. Our findings shed light on designing and operating IDLSes in a more secure and informative fashion, leading to several promising research directions.

Bridging Interpretability and Robustness Using LIME-Guided Model Refinement

An Extension of LIME with Improvement of Interpretability and Fidelity

BMB-LIME: LIME with modeling local nonlinearity and uncertainty in explainability

"Why Should You Trust My Explanation?" Understanding Uncertainty in LIME Explanations

GLIME: General, Stable and Local LIME Explanation

Interpretable Deep Learning Models: Enhancing Transparency and Trustworthiness in Explainable AI

Harnessing the Power of Explanations for Incremental Training: A LIME-Based Approach

The Effect of Model Size on LLM Post-hoc Explainability via LIME

Model-Agnostic Interpretability of Machine Learning

Locally Invariant Explanations: Towards Stable and Unidirectional Explanations through Local Invariant Learning

Are Your Explanations Reliable? Investigating the Stability of LIME in Explaining Text Classifiers by Marrying XAI and Adversarial Attack

Improving Network Interpretability via Explanation Consistency Evaluation

G-LIME: Statistical Learning for Local Interpretations of Deep Neural Networks Using Global Priors.

Interpretability and Transparency of Machine Learning in File Fragment Analysis with Explainable Artificial Intelligence

Using Decision Tree as Local Interpretable Model in Autoencoder-based LIME

Local Interpretable Model Agnostic Shap Explanations for machine learning models

DLIME: A Deterministic Local Interpretable Model-Agnostic Explanations Approach for Computer-Aided Diagnosis Systems

An Empirical Study on the Relation between Network Interpretability and Adversarial Robustness

Designing Inherently Interpretable Machine Learning Models

KNOW How to Make Up Your Mind! Adversarially Detecting and Alleviating Inconsistencies in Natural Language Explanations

Interpretable Deep Learning under Fire