Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience

Martina G. Vilas,Federico Adolfi,David Poeppel,Gemma Roig
2024-07-31
Abstract:Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field: Cognitive Neuroscience. Here we draw the relevant connections and highlight lessons that can be transferred productively between fields. Based on these, we propose a general conceptual framework and give concrete methodological strategies for building mechanistic explanations in AI inner interpretability research. With this conceptual framework, Inner Interpretability can fend off critiques and position itself on a productive path to explain AI systems.
Artificial Intelligence,Machine Learning,Neurons and Cognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key issues in the field of Inner Interpretability of AI: 1. **Lack of Conceptual Framework**: - Despite many interesting results in the field of internal analysis in recent years, there is still a lack of a unified conceptual framework to guide the development, discussion, analysis, and improvement of these mechanism explanations. This makes the field susceptible to criticism, questioning its contribution to the overall goals of AI. 2. **Methodological Issues**: - The current methodological strategies are not fully understood, which may lead to misleading or contradictory conclusions. There are methodological shortcomings in research practices, resulting in insufficient and comprehensive explanations of the behavior of complex systems. 3. **Insufficient Generalization Ability**: - Existing methods often achieve only weak generalization ability when dealing with real-world problems or models. This limits the effectiveness of internal analysis in practical applications. 4. **Unclear Objectives**: - There is a lack of clear definitions for the core issues of the field and how to mechanically understand models. This leads to the selection of research problems being driven by existing technologies and heuristic methods rather than scientific needs. 5. **Similar Issues with Cognitive Neuroscience**: - The problems faced by the field of internal analysis are very similar to those long-standing in cognitive neuroscience. However, the connections and lessons between these two fields have not been fully utilized. ### Solutions To address the above issues, the paper proposes a conceptual framework and draws on methodological strategies from cognitive neuroscience: 1. **Multi-Level Explanation Framework**: - Introduce a multi-level explanation framework (e.g., Marr & Poggio, 1976) to comprehensively analyze the internal mechanisms of models from three levels: computational problems, algorithmic descriptions, and implementation details. 2. **Mutual Constraint Strategy**: - Use mutual constraints between different levels to guide and verify the construction of mechanism explanations. For example, high-level functional descriptions can provide guidance for the search for low-level neural mechanisms. 3. **Choosing Appropriate Levels of Abstraction**: - Choose appropriate levels of abstraction to improve human understandability and computational feasibility of explanations. This includes multiple levels from microscopic neurons to macroscopic modules and representation trajectories. 4. **Combining Top-Down and Bottom-Up Approaches**: - Combine top-down (based on predefined representations and operations) and bottom-up (starting from the basic elements of the network) research methods to reduce the impact of assumptions and improve the consistency of explanations. Through these strategies, the paper aims to provide a more solid foundation for the field of internal analysis, enabling it to better address criticisms and promote the overall progress of AI research.