Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

Supriya Manna,Niladri Sett
2024-10-09
Abstract:Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies in the reliability assessment of explainers in current natural language processing (NLP). Specifically, the existing faithfulness assessment methods have the following problems: 1. **Linear Assumption**: Many assessment methods are based on the linear assumption, that is, the importance of each word is independent of other words. This assumption often does not hold in practical applications, leading to inaccurate assessment results. 2. **Restrictive Assumption**: Existing methods are usually evaluated in specific and restrictive settings, and these settings may not fully reflect the true behavior of the model. 3. **Misleading Assessment**: Some evaluation metrics such as "Area Under the Perturbation Curve" (AUPC) can be seriously misleading because they assume that the evaluation metric itself is the real situation, while in fact these metrics may not be accurate. 4. **Adversarial Robustness Assumption**: The adversarial robustness assumption holds that similar inputs should produce similar explanations. However, this assumption may not hold in practice because the reasoning process of the model may indeed have changed, rather than the explainer being unreliable. To overcome these problems, the author introduced a new assessment method - **Adversarial Sensitivity**. Adversarial Sensitivity measures the faithfulness by evaluating the performance of the explainer when the model is under adversarial attack. Specifically, the author made the following contributions: - **Introducing the Concept of Adversarial Sensitivity**: Defined adversarial sensitivity and proposed a necessary test method based on this concept. - **Constructing a Robust Experimental Framework**: Described in detail the experimental framework used for faithfulness testing. - **Empirical Research**: Conducted adversarial sensitivity tests on six state - of - the - art posterior explainers on three text classification datasets and reported their consistency with the popular deletion - method test. Through this method, the author aims to provide a more comprehensive and reliable method for assessing the faithfulness of explainers, thereby helping researchers and practitioners better understand and trust complex deep - learning models.