Abstract:For AI systems to garner widespread public acceptance, we must develop methods capable of explaining the decisions of black-box models such as neural networks. In this work, we identify two issues of current explanatory methods. First, we show that two prevalent perspectives on explanations --- feature-additivity and feature-selection --- lead to fundamentally different instance-wise explanations. In the literature, explainers from different perspectives are currently being directly compared, despite their distinct explanation goals. The second issue is that current post-hoc explainers are either validated under simplistic scenarios (on simple models such as linear regression, or on models trained on syntactic datasets), or, when applied to real-world neural networks, explainers are commonly validated under the assumption that the learned models behave reasonably. However, neural networks often rely on unreasonable correlations, even when producing correct decisions. We introduce a verification framework for explanatory methods under the feature-selection perspective. Our framework is based on a non-trivial neural network architecture trained on a real-world task, and for which we are able to provide guarantees on its inner workings. We validate the efficacy of our evaluation by showing the failure modes of current explainers. We aim for this framework to provide a publicly available, off-the-shelf evaluation when the feature-selection perspective on explanations is needed.

Explaining Reject Options of Learning Vector Quantization Classifiers

Model Agnostic Local Explanations of Reject

"Even if ..." -- Diverse Semifactual Explanations of Reject

Logic-based Explanations for Linear Support Vector Classifiers with Reject Option

Optimum Reject Options for Prototype-based Classification

Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A Review

Relevant Irrelevance: Generating Alterfactual Explanations for Image Classifiers

Counterfactual Explanations for Support Vector Machine Models

A Comparative Analysis of Counterfactual Explanation Methods for Text Classifiers

Viewing the process of generating counterfactuals as a source of knowledge: a new approach for explaining classifiers

Counterfactual Explanations for Machine Learning: Challenges Revisited

Optimal strategies for reject option classifiers

Counterfactual explanations and how to find them: literature review and benchmarking

The Impact of Explanations on AI Competency Prediction in VQA

Can I Trust the Explainer? Verifying Post-hoc Explanatory Methods

On the computation of counterfactual explanations -- A survey

Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations

Generating Counterfactual Explanations with Natural Language

A Counterfactual Explanation Framework for Retrieval Models

Exploring the Plausibility of Hate and Counter Speech Detectors with Explainable AI

ViCE: Visual Counterfactual Explanations for Machine Learning Models