Abstract:Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field: Cognitive Neuroscience. Here we draw the relevant connections and highlight lessons that can be transferred productively between fields. Based on these, we propose a general conceptual framework and give concrete methodological strategies for building mechanistic explanations in AI inner interpretability research. With this conceptual framework, Inner Interpretability can fend off critiques and position itself on a productive path to explain AI systems.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several key issues in the field of Inner Interpretability of AI: 1. **Lack of Conceptual Framework**: - Despite many interesting results in the field of internal analysis in recent years, there is still a lack of a unified conceptual framework to guide the development, discussion, analysis, and improvement of these mechanism explanations. This makes the field susceptible to criticism, questioning its contribution to the overall goals of AI. 2. **Methodological Issues**: - The current methodological strategies are not fully understood, which may lead to misleading or contradictory conclusions. There are methodological shortcomings in research practices, resulting in insufficient and comprehensive explanations of the behavior of complex systems. 3. **Insufficient Generalization Ability**: - Existing methods often achieve only weak generalization ability when dealing with real-world problems or models. This limits the effectiveness of internal analysis in practical applications. 4. **Unclear Objectives**: - There is a lack of clear definitions for the core issues of the field and how to mechanically understand models. This leads to the selection of research problems being driven by existing technologies and heuristic methods rather than scientific needs. 5. **Similar Issues with Cognitive Neuroscience**: - The problems faced by the field of internal analysis are very similar to those long-standing in cognitive neuroscience. However, the connections and lessons between these two fields have not been fully utilized. ### Solutions To address the above issues, the paper proposes a conceptual framework and draws on methodological strategies from cognitive neuroscience: 1. **Multi-Level Explanation Framework**: - Introduce a multi-level explanation framework (e.g., Marr & Poggio, 1976) to comprehensively analyze the internal mechanisms of models from three levels: computational problems, algorithmic descriptions, and implementation details. 2. **Mutual Constraint Strategy**: - Use mutual constraints between different levels to guide and verify the construction of mechanism explanations. For example, high-level functional descriptions can provide guidance for the search for low-level neural mechanisms. 3. **Choosing Appropriate Levels of Abstraction**: - Choose appropriate levels of abstraction to improve human understandability and computational feasibility of explanations. This includes multiple levels from microscopic neurons to macroscopic modules and representation trajectories. 4. **Combining Top-Down and Bottom-Up Approaches**: - Combine top-down (based on predefined representations and operations) and bottom-up (starting from the basic elements of the network) research methods to reduce the impact of assumptions and improve the consistency of explanations. Through these strategies, the paper aims to provide a more solid foundation for the field of internal analysis, enabling it to better address criticisms and promote the overall progress of AI research.

Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience

Real Sparks of Artificial Intelligence and the Importance of Inner Interpretability

The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

A functional contextual, observer-centric, quantum mechanical, and neuro-symbolic approach to solving the alignment problem of artificial general intelligence: safe AI through intersecting computational psychological neuroscience and LLM architecture for emergent theory of mind

Mechanistic Interpretability for AI Safety -- A Review

Designing explainable artificial intelligence with active inference: A framework for transparent introspection and decision-making

On Interdisciplinary Studies of a New Generation of Artificial Intelligence and Logic

Computational principles of intelligence: learning and reasoning with neural networks

Crafting explainable artificial intelligence through active inference: A model for transparent introspection and decision-making

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings

Active Inference as a Computational Framework for Consciousness

Algorithmic unconscious: why psychoanalysis helps in understanding AI

Making AI Intelligible: Philosophical Foundations

What is Interpretability?

Thinking Fast and Slow in AI

A Theoretical Framework for AI Models Explainability with Application in Biomedicine

Artificial cognition: How experimental psychology can help generate explainable artificial intelligence

Interfacing consciousness

Diagnosing AI Explanation Methods with Folk Concepts of Behavior