Abstract:The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.

Interpretable Artificial Intelligence through the Lens of Feature Interaction

Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond

A Survey of the Interpretability Aspect of Deep Learning Models

A Feature Structure Based Interpretability Evaluation Approach for Deep Learning

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Interpretability of deep learning models: A survey of results

Interpreting Deep Learning Models in Natural Language Processing: A Review

Asymmetric feature interaction for interpreting model predictions

Feature-Based Interpretation of Image Classification With the Use of Convolutional Neural Networks

Interpretable Machine Learning -- A Brief History, State-of-the-Art and Challenges

Interpretable machine learning: Fundamental principles and 10 grand challenges

On Interpretability of Artificial Neural Networks: A Survey

What is Interpretable? Using Machine Learning to Design Interpretable Decision-Support Systems

Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models

On the Semantic Interpretability of Artificial Intelligence Models

Towards Understanding Sensitive and Decisive Patterns in Explainable AI: A Case Study of Model Interpretation in Geometric Deep Learning

Interpretability of Machine Learning: Recent Advances and Future Prospects

Multicriteria interpretability driven deep learning

Explainable AI: A Review of Machine Learning Interpretability Methods

Looking deeper into interpretable deep learning in neuroimaging: a comprehensive survey