Abstract:A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the MLP layer. In this paper, we analyze bilinear MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity that nevertheless achieves competitive performance. Bilinear MLPs can be fully expressed in terms of linear operations using a third-order tensor, allowing flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We use this understanding to craft adversarial examples, uncover overfitting, and identify small language model circuits directly from the weights alone. Our results demonstrate that bilinear layers serve as an interpretable drop-in replacement for current activation functions and that weight-based interpretability is viable for understanding deep-learning models.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of understanding the computational mechanism of multi - layer perceptrons (MLPs) in deep neural networks. Specifically, current interpretive work can extract features from hidden activations, but usually fails to explain how MLP weights construct these features. Due to the element - level nonlinearity introducing high - order interactions, it is difficult to track the computational process through MLP layers. To solve this problem, the authors analyzed a model called bilinear MLP (Bilinear MLP), which is a gated linear unit (GLU) without element - level nonlinearity but can still achieve competitive performance. The bilinear MLP can be fully expressed as a linear operation through a third - order tensor, allowing for flexible analysis of weights. By factorizing the weights of the bilinear MLP, the authors revealed low - rank interpretable structures in toy tasks, image classification, and language modeling. In addition, they also showed how to use this understanding to construct adversarial examples, detect overfitting, and directly identify small language model circuits from the weights. ### Main contributions of the paper 1. **Method introduction**: In section 3, the authors introduced several methods for analyzing bilinear MLPs. One of the methods is to factorize the weights into a set of eigenvectors, which can equivalently explain the output along a given direction. 2. **Experimental verification**: In section 4, the authors demonstrated the application of eigenvector factorization in multiple image classification tasks, revealing low - rank interpretable structures. Smaller eigenvalue terms can be truncated without affecting performance. Through eigenvectors, one can see how regularization reduces the signs of overfitting in the extracted features and can be used to construct adversarial examples. 3. **Language model analysis**: In section 5, the authors analyzed how the bilinear MLP computes output features from input features, especially those obtained through sparse dictionary learning (SDL). They highlighted a small circuit that flips the sentiment polarity of the next token when the current token is a negative word (such as "not"). In addition, many output features are highly correlated with low - rank approximations, indicating that weight - based interpretability is feasible in large - language models. ### Formula representation - Definition of the bilinear layer: \[ g(x)=(Wx)\odot(Vx) \] where \(W\) and \(V\) are weight matrices, \(x\) is an input vector, and \(\odot\) represents element - wise multiplication. - Definition of the feature matrix: \[ B_{a::}=w_{a}v_{a}^{T} \] where \(w_{a}\) and \(v_{a}\) are the \(a\)-th rows of the weight matrices \(W\) and \(V\), respectively. - Decomposition of the symmetric matrix: \[ B_{a::}=\frac{1}{2}(B_{a::}+B_{a::}^{T})+\frac{1}{2}(B_{a::}-B_{a::}^{T}) \] Since the contribution of the symmetric part \(B_{sym}\) is zero when the input is the same, only the symmetric part needs to be considered. - Eigenvalue decomposition: \[ Q = \sum_{i = 1}^{d}\lambda_{i}v_{i}v_{i}^{T} \] where \(Q\) is the interaction matrix, \(\lambda_{i}\) is an eigenvalue, and \(v_{i}\) is an eigenvector. Through these methods, the authors showed that the bilinear MLP can be an interpretable substitute for ordinary MLPs in multiple settings, thus providing a new way to understand deep - learning models.

Bilinear MLPs enable weight-based mechanistic interpretability

Weight-based Decomposition: A Case for Bilinear MLPs

Feature Importance Measure of a Multilayer Perceptron Based on the Presingle-Connection Layer

A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Transcoders Find Interpretable LLM Feature Circuits

Bilinear Convolution Decomposition for Causal RL Interpretability

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Scalable Partial Explainability in Neural Networks via Flexible Activation Functions

Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

From Neurons to Neutrons: A Case Study in Interpretability

Explicitising The Implicit Intrepretability of Deep Neural Networks Via Duality

Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

Mechanistic interpretability of large language models with applications to the financial services industry

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Scaling MLPs: A Tale of Inductive Bias

Pay Attention to MLPs

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

Scalable Interpretability via Polynomials

Biologically-informed neural networks guide mechanistic modeling from sparse experimental data

Mechanistic Interpretability of Binary and Ternary Transformers