Transcoders Find Interpretable LLM Feature Circuits

Jacob Dunefsky,Philippe Chlenski,Neel Nanda
2024-06-18
Abstract:A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. We then introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the greater-than circuit in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at <a class="link-external link-https" href="https://github.com/jacobdunefsky/transcoder_circuits" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the challenges encountered in fine-grained circuit analysis within transformer-based language models. Specifically, it focuses on how to handle the issues related to the multi-layer perceptron (MLP) sublayers. The presence of MLP sublayers makes it difficult to perform fine-grained analysis of specific subgraphs (i.e., circuits) of the model's behavior or capabilities. The main issues include: 1. **Polysemanticity**: Neurons often activate on many unrelated concepts, making direct analysis of MLP neurons complex. 2. **Dense Activation**: MLP neurons typically activate in a dense manner, leading to circuit analysis becoming either infeasibly large or unable to distinguish between local and global behaviors. To address these issues, the authors introduce a method called **Transcoders**. Transcoders approximate the original densely activated MLP layer with a wider but sparsely activated MLP layer. This approach not only improves the interpretability of the model but also maintains fidelity to the original computation. In this way, Transcoders can replace complex model components with more understandable approximations, allowing researchers to better understand and interpret the model's behavior. ### Main Contributions 1. **Validation of Transcoders' Effectiveness and Interpretability**: The authors demonstrate that Transcoders can faithfully approximate MLP sublayers and perform comparably or even better than sparse autoencoders (SAEs) in terms of sparsity, fidelity, and human interpretability. 2. **Introduction of a New Circuit Analysis Method**: The authors present a new method for circuit analysis based on Transcoder weights, which can decompose circuits into input-invariant and input-dependent components, thereby providing a clearer understanding of the model's internal mechanisms. ### Application Examples The authors applied the Transcoder method to circuit analysis for various tasks, including: - **Blind Case Study**: Reverse engineering unknown features through circuit analysis without looking at specific example texts, thereby inferring the semantics of the features. - **"Greater Than" Circuit Analysis in GPT2-small**: An in-depth analysis of the "greater than" circuit in the GPT2-small model, validating the effectiveness of the Transcoder method in understanding and interpreting model behavior. ### Conclusion By introducing Transcoders, the authors provide an effective method to address the challenges of fine-grained circuit analysis in transformer-based language models. This approach not only enhances the interpretability of the models but also offers researchers new tools to understand and optimize these complex models.