Abstract:Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies in the reliability assessment of explainers in current natural language processing (NLP). Specifically, the existing faithfulness assessment methods have the following problems: 1. **Linear Assumption**: Many assessment methods are based on the linear assumption, that is, the importance of each word is independent of other words. This assumption often does not hold in practical applications, leading to inaccurate assessment results. 2. **Restrictive Assumption**: Existing methods are usually evaluated in specific and restrictive settings, and these settings may not fully reflect the true behavior of the model. 3. **Misleading Assessment**: Some evaluation metrics such as "Area Under the Perturbation Curve" (AUPC) can be seriously misleading because they assume that the evaluation metric itself is the real situation, while in fact these metrics may not be accurate. 4. **Adversarial Robustness Assumption**: The adversarial robustness assumption holds that similar inputs should produce similar explanations. However, this assumption may not hold in practice because the reasoning process of the model may indeed have changed, rather than the explainer being unreliable. To overcome these problems, the author introduced a new assessment method - **Adversarial Sensitivity**. Adversarial Sensitivity measures the faithfulness by evaluating the performance of the explainer when the model is under adversarial attack. Specifically, the author made the following contributions: - **Introducing the Concept of Adversarial Sensitivity**: Defined adversarial sensitivity and proposed a necessary test method based on this concept. - **Constructing a Robust Experimental Framework**: Described in detail the experimental framework used for faithfulness testing. - **Empirical Research**: Conducted adversarial sensitivity tests on six state - of - the - art posterior explainers on three text classification datasets and reported their consistency with the popular deletion - method test. Through this method, the author aims to provide a more comprehensive and reliable method for assessing the faithfulness of explainers, thereby helping researchers and practitioners better understand and trust complex deep - learning models.

Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

Towards Faithful Model Explanation in NLP: A Survey

Does Faithfulness Conflict with Plausibility? An Empirical Study in Explainable AI across NLP Tasks

Faithfulness Tests for Natural Language Explanations

New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI

The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models

On Measuring Faithfulness or Self-consistency of Natural Language Explanations

Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models

Logical satisfiability of counterfactuals for faithful explanations in NLI

Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness?

Evaluating the overall sensitivity of saliency-based explanation methods

With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

On the (In)fidelity and Sensitivity for Explanations

FaithLM: Towards Faithful Explanations for Large Language Models

Perks and Pitfalls of Faithfulness in Regular, Self-Explainable and Domain Invariant GNNs

Faithfulness Measurable Masked Language Models