Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde,Michael T. Pearce,Lee Sharkey
2024-10-15
Abstract:Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise. We further argue that interpretable SAEs require an additional property, "independent additivity": features should be able to be understood separately. We demonstrate an example of applying our MDL-inspired framework by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal, as opposed to SAEs with features for memorised digits from the dataset or small digit fragments. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity such as undesirable feature splitting and that this framework naturally suggests new hierarchical SAE architectures which provide more concise explanations.
Machine Learning,Artificial Intelligence,Information Theory
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper attempts to address several key issues with Sparse Autoencoders (SAEs) in explaining the internal representations of neural networks: 1. **Over-sparsity and Broadness**: Traditionally, SAEs explain neural network activations by optimizing reconstruction loss and sparsity. However, this optimization method often results in SAEs becoming very broad and sparse, which is not conducive to interpretability. 2. **Independent Additivity of Explanations**: To improve interpretability, SAEs need to have "independent additivity," meaning each feature can be understood independently without considering the activation of other features. Traditional SAEs often fail to meet this requirement due to causal entanglement between features. 3. **Relationship Between Description Length and Sparsity**: The paper proposes using the Minimal Description Length (MDL) principle to replace simple sparsity optimization. The MDL principle focuses not only on the accuracy of explanations but also on their simplicity, thereby avoiding the adverse feature splitting issues that simple maximization of sparsity might cause. 4. **Feature Splitting Problem**: In large language models, a larger dictionary leads to finer-grained feature learning, a phenomenon known as "feature splitting." While some feature splitting is beneficial, excessive feature splitting wastes dictionary capacity and does not enhance interpretability. The paper explores how to reduce unnecessary feature splitting through the MDL principle. ### Solutions The paper proposes an information-theoretic framework that views SAEs as loss compression algorithms for conveying explanations of neural activations. Specific solutions include: 1. **MDL Principle**: Selecting the optimal SAE model by minimizing the description length. The MDL principle considers both reconstruction error and the simplicity of explanations, thus avoiding issues arising from merely pursuing sparsity. 2. **Independent Additivity**: Ensuring SAE features have independent additivity so that each feature can be understood independently. The paper discusses SAE architectures suitable for independent additivity, such as linear decoders and directed tree decoders. 3. **Experimental Validation**: Demonstrating through experiments on the MNIST dataset how the MDL principle can find more intuitive and interpretable features. The experimental results show that the MDL principle can identify meaningful stroke features rather than point features or sample features in the dataset. 4. **Decision Boundary for Feature Splitting**: Proposing a decision boundary to determine when feature splitting is beneficial and when it is harmful. By minimizing the description length, unnecessary feature splitting can be restricted. ### Conclusion By introducing the MDL principle, the paper provides a new perspective on optimizing the interpretability of SAEs. This approach not only improves the accuracy of explanations but also enhances their simplicity and independent additivity, thereby reducing unnecessary feature splitting and improving the overall interpretability of the model.