Abstract:Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise. We further argue that interpretable SAEs require an additional property, "independent additivity": features should be able to be understood separately. We demonstrate an example of applying our MDL-inspired framework by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal, as opposed to SAEs with features for memorised digits from the dataset or small digit fragments. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity such as undesirable feature splitting and that this framework naturally suggests new hierarchical SAE architectures which provide more concise explanations.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper attempts to address several key issues with Sparse Autoencoders (SAEs) in explaining the internal representations of neural networks: 1. **Over-sparsity and Broadness**: Traditionally, SAEs explain neural network activations by optimizing reconstruction loss and sparsity. However, this optimization method often results in SAEs becoming very broad and sparse, which is not conducive to interpretability. 2. **Independent Additivity of Explanations**: To improve interpretability, SAEs need to have "independent additivity," meaning each feature can be understood independently without considering the activation of other features. Traditional SAEs often fail to meet this requirement due to causal entanglement between features. 3. **Relationship Between Description Length and Sparsity**: The paper proposes using the Minimal Description Length (MDL) principle to replace simple sparsity optimization. The MDL principle focuses not only on the accuracy of explanations but also on their simplicity, thereby avoiding the adverse feature splitting issues that simple maximization of sparsity might cause. 4. **Feature Splitting Problem**: In large language models, a larger dictionary leads to finer-grained feature learning, a phenomenon known as "feature splitting." While some feature splitting is beneficial, excessive feature splitting wastes dictionary capacity and does not enhance interpretability. The paper explores how to reduce unnecessary feature splitting through the MDL principle. ### Solutions The paper proposes an information-theoretic framework that views SAEs as loss compression algorithms for conveying explanations of neural activations. Specific solutions include: 1. **MDL Principle**: Selecting the optimal SAE model by minimizing the description length. The MDL principle considers both reconstruction error and the simplicity of explanations, thus avoiding issues arising from merely pursuing sparsity. 2. **Independent Additivity**: Ensuring SAE features have independent additivity so that each feature can be understood independently. The paper discusses SAE architectures suitable for independent additivity, such as linear decoders and directed tree decoders. 3. **Experimental Validation**: Demonstrating through experiments on the MNIST dataset how the MDL principle can find more intuitive and interpretable features. The experimental results show that the MDL principle can identify meaningful stroke features rather than point features or sample features in the dataset. 4. **Decision Boundary for Feature Splitting**: Proposing a decision boundary to determine when feature splitting is beneficial and when it is harmful. By minimizing the description length, unnecessary feature splitting can be restricted. ### Conclusion By introducing the MDL principle, the paper provides a new perspective on optimizing the interpretability of SAEs. This approach not only improves the accuracy of explanations but also enhances their simplicity and independent additivity, thereby reducing unnecessary feature splitting and improving the overall interpretability of the model.

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Which Neural Network Makes More Explainable Decisions? an Approach Towards Measuring Explainability

Automatically Interpreting Millions of Features in Large Language Models

Decomposing The Dark Matter of Sparse Autoencoders

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

Analyzing (In)Abilities of SAEs via Formal Languages

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Improving Dictionary Learning with Gated Sparse Autoencoders

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Disentangling Dense Embeddings with Sparse Autoencoders

Residual Stream Analysis with Multi-Layer SAEs

Scalable Partial Explainability in Neural Networks via Flexible Activation Functions

How to Squeeze An Explanation Out of Your Model

Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Improve Interpretability of Neural Networks Via Sparse Contrastive Coding.

An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation