Abstract:The Transformer models have achieved unprecedented breakthroughs in text classification, and have become the foundation of most state-of-the-art NLP systems. The core function that drives the success is the attention mechanism, which provides the ability to dynamically focus on different parts of the input sequence when producing the predictions. Several previous works have investigated the usage of attention weights to explain the model predictions, because intuitively, attention weights reflect the importance of the input positions in the output. Specifically, the objective for explanation is to compute a relevance score for each input token, such that the key input words that are most important to the prediction can be identified. However, previous efforts produced mixed results. We find that the key reason why attention weights cannot be directly used as effective relevance indications is because they do not contain the directional information for relevance (i.e., whether the input tokens contribute towards or against the prediction). We then propose two novel explanation techniques, namely AGrad and RePAGrad, that produce directional relevance scores based on attention weights. To evaluate the explanation performance, we propose three properties that an effective explanation method should satisfy (i.e., faithfulness, resilience, and consistency), and design the corresponding test to quantify each property. Through extensive evaluations with Transformer models and pre-trained BERT models on multiple public text classification datasets, we show that AGrad and RePAGrad significantly outperform existing state-of-the-art explanation methods in faithfulness and consistency, at the cost of nominal degradation on resilience compared to attention weights. In addition, we reveal that elements of a model architecture can play an important role towards explainability.

Prototypical Convolutional Neural Network for a Phrase-Based Explanation of Sentiment Classification

A Pixel-Level Explainable Approach of Convolutional Neural Networks and Its Application

SelfExplain: A Self-Explaining Architecture for Neural Text Classifiers

Convolution-Based Neural Attention with Applications to Sentiment Classification

Faithful and Plausible Natural Language Explanations for Image Classification: A Pipeline Approach

On Exploring Attention-based Explanation for Transformer Models in Text Classification

ConvTextTM: An Explainable Convolutional Tsetlin Machine Framework for Text Classification.

CoProNN: Concept-based Prototypical Nearest Neighbors for Explaining Vision Models

Deep Learning for Case-Based Reasoning Through Prototypes: A Neural Network That Explains Its Predictions

GAProtoNet: A Multi-head Graph Attention-based Prototypical Network for Interpretable Text Classification

Enhanced Prototypical Part Network (EPPNet) For Explainable Image Classification Via Prototypes

Unsupervised Explanation Generation Via Correct Instantiations

Keep the Faith: Faithful Explanations in Convolutional Neural Networks for Case-Based Reasoning

Language Model Meets Prototypes: Towards Interpretable Text Classification Models through Prototypical Networks

Explaining Deep Convolutional Neural Networks for Image Classification by Evolving Local Interpretable Model-agnostic Explanations

ProtoTEx: Explaining Model Decisions with Prototype Tensors

Towards Faithful Explanations for Text Classification with Robustness Improvement and Explanation Guided Training

T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers

CoSy: Evaluating Textual Explanations of Neurons

A Unified Concept-Based System for Local, Global, and Misclassification Explanations