Abstract:Neural networks have been criticized for their lack of easy interpretation, which undermines confidence in their use for important applications. Here, we introduce a novel technique, interpreting a trained neural network by investigating its flip points. A flip point is any point that lies on the boundary between two output classes: e.g. for a neural network with a binary yes/no output, a flip point is any input that generates equal scores for "yes" and "no". The flip point closest to a given input is of particular importance, and this point is the solution to a well-posed optimization problem. This paper gives an overview of the uses of flip points and how they are computed. Through results on standard datasets, we demonstrate how flip points can be used to provide detailed interpretation of the output produced by a neural network. Moreover, for a given input, flip points enable us to measure confidence in the correctness of outputs much more effectively than softmax score. They also identify influential features of the inputs, identify bias, and find changes in the input that change the output of the model. We show that distance between an input and the closest flip point identifies the most influential points in the training data. Using principal component analysis (PCA) and rank-revealing QR factorization (RR-QR), the set of directions from each training input to its closest flip point provides explanations of how a trained neural network processes an entire dataset: what features are most important for classification into a given class, which features are most responsible for particular misclassifications, how an adversary might fool the network, etc. Although we investigate flip points for neural networks, their usefulness is actually model-agnostic.

Closed-Form Interpretation of Neural Network Classifiers with Symbolic Gradients

Closed-Form Interpretation of Neural Network Latent Spaces with Symbolic Gradients

Understanding Neural Networks through Representation Erasure.

Exploring Hidden Semantics in Neural Networks with Symbolic Regression

Opening the Black Box of Neural Networks: Methods for Interpreting Neural Network Models in Clinical Applications

Interpretable Neural PDE Solvers using Symbolic Frameworks

A Symbolic Approach to Explaining Bayesian Network Classifiers

Interpretable Function Embedding and Module in Convolutional Neural Networks

Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning

A Neuro-Symbolic Method for Solving Differential and Functional Equations

Seeing in Words: Learning to Classify through Language Bottlenecks

Explaining neural networks without access to training data

Symbolic regression via neural networks

A Test Statistic Estimation-based Approach for Establishing Self-interpretable CNN-based Binary Classifiers

A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts

Representations as Language: An Information-Theoretic Framework for Interpretability

A Theory of Diagnostic Interpretation in Supervised Classification

Interpreting Neural Networks Using Flip Points

What is Interpretability?

Understanding polysemanticity in neural networks through coding theory