Opening the AI black box: program synthesis via mechanistic interpretability

Eric J. Michaud,Isaac Liao,Vedang Lad,Ziming Liu,Anish Mudide,Chloe Loughridge,Zifan Carl Guo,Tara Rezaei Kheirkhah,Mateja Vukelić,Max Tegmark
2024-02-08
Abstract:We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large language models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve how to open the "black box" of artificial intelligence, that is, to understand the internal working principles of neural networks through mechanistic interpretability. Specifically, the paper proposes a new method named MIPS (Mechanistic - Interpretability - based Program Synthesis), which can automatically distill the learning algorithms in trained neural networks into Python code. #### Specific problems include: 1. **The black - box problem of neural networks**: - Modern machine - learning models, especially deep neural networks, are usually black - box models, and it is difficult to understand their internal working principles. This makes these models lack transparency and credibility in some key applications. 2. **Automated mechanistic interpretability**: - The goal of the paper is to make neural networks more interpretable and credible through automated means. Traditional mechanistic interpretability research depends on human effort and is difficult to scale to larger models. Therefore, the authors hope to develop a fully automated mechanistic interpretability method. 3. **Program synthesis**: - Program synthesis itself is a classic problem, aiming to automatically generate programs from input - output examples or natural language descriptions. Existing large - language models (such as GPT - 4) perform well on some tasks, but they rely on human - written code data (such as code on GitHub). MIPS attempts to directly extract algorithms from neural networks without using human training data. #### Solutions: MIPS achieves its goals through the following steps: 1. **Neural network training**: First, train a recurrent neural network (RNN) to learn the algorithm for performing the required task. 2. **Neural network simplification**: Reduce the complexity of the neural network through automatic simplification techniques while maintaining its performance. 3. **Finite - state - machine extraction**: Convert the simplified neural network into a finite - state machine. 4. **Symbolic regression**: Use the symbolic regression method to capture the learned algorithm and represent it as an exact symbolic expression. Finally, MIPS can generate Python code with the same input - output behavior as the original neural network, thus improving the interpretability and transparency of the algorithm. #### Results: MIPS performs well in 62 algorithm - task benchmark tests and solves 32 of them, including 13 tasks that GPT - 4 cannot solve. This shows that MIPS has unique advantages in some specific tasks, especially when there is no need to rely on human - written code data. In summary, the main contribution of this paper is to provide a new, automated mechanistic interpretability method, enabling neural networks not only to learn algorithms but also to distill these algorithms in an interpretable form.