Tackling Polysemanticity with Neuron Embeddings

Alex Foote
2024-11-13
Abstract:We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).
Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of neuron polysemanticity. Specifically, the author introduced a new method called "neuron embeddings", aiming to solve this problem by identifying the different semantic behaviors of neurons in their feature dataset examples. The following is a summary of the core content of the paper: ### 1. Research Background The mechanistic interpretability (MI) of neural networks aims to decompose neural networks into their components and understand how these components interact to produce the behavior of the network. However, a major obstacle is that neurons usually respond to multiple completely different concepts, a phenomenon known as polysemanticity. This makes it difficult to find a clear and simple explanation for neuron behavior and weakens the view that neurons are a natural basis for model decomposition. Especially in language models, polysemanticity is particularly prevalent, making the interpretation of MLP layers very difficult. ### 2. Solution: Neuron Embeddings To solve the polysemanticity problem, the author proposed "neuron embeddings", which is a representation method that captures the information of neuron responses caused by a given input. For a given input \(x\) and an activated neuron \(N_{i,j}\), the neuron embedding \(e_{i,j}\) is defined as: \[e_{i,j}=h_{i - 1}\odot w_{i,j}\] where \(h_{i - 1}\) is the internal vector representation (pre - MLP embedding) of the input before entering the \(i\)-th layer, \(w_{i,j}\) is the input weight of the neuron, and \(\odot\) represents the element - wise product. ### 3. Applications and Effects - **Cluster Analysis**: By calculating neuron embeddings and clustering them, the dataset examples of neurons can be separated into their different semantic behaviors, making it easier for manual or automatic interpretation. - **Measuring Polysemanticity**: Using neuron embeddings, simple geometric metrics such as the maximum distance between points, the average distance, etc. can be calculated to measure the degree of neuron polysemanticity. - **Sparse Autoencoder (SAE) Evaluation**: Neuron embeddings can also be used to better evaluate the effect of sparse autoencoders, especially by introducing new loss terms to improve monosemanticity. ### 4. Experimental Results - **Feature Clustering**: Through experiments on the GPT2 - small model, it is shown that neuron embeddings can effectively capture the semantic similarity between neuron dataset examples and successfully group different expressions of the same concept. - **Sparse Autoencoder Training**: Experiments on the MNIST dataset show that after adding the neuron embedding loss term, although the reconstruction error increases and the activation sparsity decreases, the monosemanticity and interpretability of neurons are significantly improved, and the proportion of inactive neurons (dead neurons) is reduced at the same time. ### 5. Conclusion The paper shows the effectiveness of neuron embeddings in solving neuron polysemanticity, which not only helps to understand neuron behavior more clearly but also may improve the results of automated interpretation techniques based on dataset examples. Future research can further explore the specific mechanism of the neuron embedding loss term and its impact on model performance. In conclusion, this paper provides a novel and effective tool, neuron embeddings, to deal with the challenge of neuron polysemanticity in neural networks.