Abstract:Artificial neural networks have long been understood as "black boxes": though we know their computation graphs and learned parameters, the knowledge encoded by these weights and functions they perform are not inherently interpretable. As such, from the early days of deep learning, there have been efforts to explain these models' behavior and understand them internally; and recently, mechanistic interpretability (MI) has emerged as a distinct research area studying the features and implicit algorithms learned by foundation models such as large language models. In this work, we aim to ground MI in the context of cognitive science, which has long struggled with analogous questions in studying and explaining the behavior of "black box" intelligent systems like the human brain. We leverage several important ideas and developments in the history of cognitive science to disentangle divergent objectives in MI and indicate a clear path forward. First, we argue that current methods are ripe to facilitate a transition in deep learning interpretation echoing the "cognitive revolution" in 20th-century psychology that shifted the study of human psychology from pure behaviorism toward mental representations and processing. Second, we propose a taxonomy mirroring key parallels in computational neuroscience to describe two broad categories of MI research, semantic interpretation (what latent representations are learned and used) and algorithmic interpretation (what operations are performed over representations) to elucidate their divergent goals and objects of study. Finally, we elaborate the parallels and distinctions between various approaches in both categories, analyze the respective strengths and weaknesses of representative works, clarify underlying assumptions, outline key challenges, and discuss the possibility of unifying these modes of interpretation under a common framework.

Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks

Training Neural Networks for Modularity aids Interpretability

From Neurons to Neutrons: A Case Study in Interpretability

Modularity Facilitates Classification Performance of Spiking Neural Networks for Decoding Cortical Spike Trains

Interpretable Function Embedding and Module in Convolutional Neural Networks

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

Leveraging Brain Modularity Prior for Interpretable Representation Learning of fMRI

Brain Decodes Deep Nets

Modular representations emerge in neural networks trained to perform context-dependent tasks

BIMM: Brain Inspired Masked Modeling for Video Representation Learning

Breaking Neural Network Scaling Laws with Modularity

Modularity maximization as a flexible and generic framework for brain network exploratory analysis

Mechanistic Interpretability of Binary and Ternary Transformers

Modular neural network via exploring category hierarchy

Modularizing while Training: A New Paradigm for Modularizing DNN Models

The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms

NeuroView: Explainable Deep Network Decision Making

Self-Supervised Interpretable End-to-End Learning via Latent Functional Modularity