Abstract:We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

What problem does this paper attempt to address?

This paper attempts to solve the problem of neuron polysemanticity. Specifically, the author introduced a new method called "neuron embeddings", aiming to solve this problem by identifying the different semantic behaviors of neurons in their feature dataset examples. The following is a summary of the core content of the paper: ### 1. Research Background The mechanistic interpretability (MI) of neural networks aims to decompose neural networks into their components and understand how these components interact to produce the behavior of the network. However, a major obstacle is that neurons usually respond to multiple completely different concepts, a phenomenon known as polysemanticity. This makes it difficult to find a clear and simple explanation for neuron behavior and weakens the view that neurons are a natural basis for model decomposition. Especially in language models, polysemanticity is particularly prevalent, making the interpretation of MLP layers very difficult. ### 2. Solution: Neuron Embeddings To solve the polysemanticity problem, the author proposed "neuron embeddings", which is a representation method that captures the information of neuron responses caused by a given input. For a given input \(x\) and an activated neuron \(N_{i,j}\), the neuron embedding \(e_{i,j}\) is defined as: \[e_{i,j}=h_{i - 1}\odot w_{i,j}\] where \(h_{i - 1}\) is the internal vector representation (pre - MLP embedding) of the input before entering the \(i\)-th layer, \(w_{i,j}\) is the input weight of the neuron, and \(\odot\) represents the element - wise product. ### 3. Applications and Effects - **Cluster Analysis**: By calculating neuron embeddings and clustering them, the dataset examples of neurons can be separated into their different semantic behaviors, making it easier for manual or automatic interpretation. - **Measuring Polysemanticity**: Using neuron embeddings, simple geometric metrics such as the maximum distance between points, the average distance, etc. can be calculated to measure the degree of neuron polysemanticity. - **Sparse Autoencoder (SAE) Evaluation**: Neuron embeddings can also be used to better evaluate the effect of sparse autoencoders, especially by introducing new loss terms to improve monosemanticity. ### 4. Experimental Results - **Feature Clustering**: Through experiments on the GPT2 - small model, it is shown that neuron embeddings can effectively capture the semantic similarity between neuron dataset examples and successfully group different expressions of the same concept. - **Sparse Autoencoder Training**: Experiments on the MNIST dataset show that after adding the neuron embedding loss term, although the reconstruction error increases and the activation sparsity decreases, the monosemanticity and interpretability of neurons are significantly improved, and the proportion of inactive neurons (dead neurons) is reduced at the same time. ### 5. Conclusion The paper shows the effectiveness of neuron embeddings in solving neuron polysemanticity, which not only helps to understand neuron behavior more clearly but also may improve the results of automated interpretation techniques based on dataset examples. Future research can further explore the specific mechanism of the neuron embedding loss term and its impact on model performance. In conclusion, this paper provides a novel and effective tool, neuron embeddings, to deal with the challenge of neuron polysemanticity in neural networks.

Tackling Polysemanticity with Neuron Embeddings

Visualizing and Understanding Neural Models in NLP

PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits

Polysemanticity and Capacity in Neural Networks

Neuron to Graph: Interpreting Language Model Neurons at Scale

Understanding polysemanticity in neural networks through coding theory

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Disentangling Dense Embeddings with Sparse Autoencoders

Neural Code Comprehension: A Learnable Representation of Code Semantics

Biologically Plausible Sparse Temporal Word Representations

Enhancing Semantic Word Representations by Embedding Deeper Word Relationships

Embedding Word Similarity with Neural Machine Translation

Automated Natural Language Explanation of Deep Visual Neurons with Large Models

Rigorously Assessing Natural Language Explanations of Neurons

Contextualized Word Embeddings Encode Aspects of Human-Like Word Sense Knowledge

Is a Single Vector Enough? Exploring Node Polysemy for Network Embedding

Semantic projection: recovering human knowledge of multiple, distinct object features from word embeddings

SPINE: SParse Interpretable Neural Embeddings

Poly2Vec: Polymorphic Encoding of Geospatial Objects for Spatial Reasoning with Deep Neural Networks

Blended, precise semantic program embeddings

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models