Abstract:We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large language models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve how to open the "black box" of artificial intelligence, that is, to understand the internal working principles of neural networks through mechanistic interpretability. Specifically, the paper proposes a new method named MIPS (Mechanistic - Interpretability - based Program Synthesis), which can automatically distill the learning algorithms in trained neural networks into Python code. #### Specific problems include: 1. **The black - box problem of neural networks**: - Modern machine - learning models, especially deep neural networks, are usually black - box models, and it is difficult to understand their internal working principles. This makes these models lack transparency and credibility in some key applications. 2. **Automated mechanistic interpretability**: - The goal of the paper is to make neural networks more interpretable and credible through automated means. Traditional mechanistic interpretability research depends on human effort and is difficult to scale to larger models. Therefore, the authors hope to develop a fully automated mechanistic interpretability method. 3. **Program synthesis**: - Program synthesis itself is a classic problem, aiming to automatically generate programs from input - output examples or natural language descriptions. Existing large - language models (such as GPT - 4) perform well on some tasks, but they rely on human - written code data (such as code on GitHub). MIPS attempts to directly extract algorithms from neural networks without using human training data. #### Solutions: MIPS achieves its goals through the following steps: 1. **Neural network training**: First, train a recurrent neural network (RNN) to learn the algorithm for performing the required task. 2. **Neural network simplification**: Reduce the complexity of the neural network through automatic simplification techniques while maintaining its performance. 3. **Finite - state - machine extraction**: Convert the simplified neural network into a finite - state machine. 4. **Symbolic regression**: Use the symbolic regression method to capture the learned algorithm and represent it as an exact symbolic expression. Finally, MIPS can generate Python code with the same input - output behavior as the original neural network, thus improving the interpretability and transparency of the algorithm. #### Results: MIPS performs well in 62 algorithm - task benchmark tests and solves 32 of them, including 13 tasks that GPT - 4 cannot solve. This shows that MIPS has unique advantages in some specific tasks, especially when there is no need to rely on human - written code data. In summary, the main contribution of this paper is to provide a new, automated mechanistic interpretability method, enabling neural networks not only to learn algorithms but also to distill these algorithms in an interpretable form.

Opening the AI black box: program synthesis via mechanistic interpretability

Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code

Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

Red Teaming Deep Neural Networks with Feature Synthesis Tools

Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

Synthesis of Mathematical programs from Natural Language Specifications

Latent Execution for Neural Program Synthesis

Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis

"Why did you do that?": Explaining black box models with Inductive Synthesis

Hierarchical Neural Program Synthesis

Challenges in Mechanistically Interpreting Model Representations

Mechanistic Design and Scaling of Hybrid Architectures

Learning Transformer Programs

Understanding Neural Code Intelligence Through Program Simplification

SynthAI: A Multi Agent Generative AI Framework for Automated Modular HLS Design Generation

Mechanistic Interpretability for AI Safety -- A Review

Neuro Symbolic Reasoning for Planning: Counterexample Guided Inductive Synthesis using Large Language Models and Satisfiability Solving

Large Language Models Synergize with Automated Machine Learning

Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis

Absynthe: Abstract Interpretation-Guided Synthesis

PLANS: Robust Program Learning from Neurally Inferred Specifications