Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide,Joshua Engels,Eric J. Michaud,Max Tegmark,Christian Schroeder de Witt
2024-10-11
Abstract:Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to scale them up to very high width, posing a computational challenge. In this work, we introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs. Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs, enabling SAEs to efficiently scale to many more features. We present experiments comparing Switch SAEs with other SAE architectures, and find that Switch SAEs deliver a substantial Pareto improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget. We also study the geometry of features across experts, analyze features duplicated across experts, and verify that Switch SAE features are as interpretable as features found by other SAE architectures.
Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the computational cost issues faced by Sparse Autoencoders (SAEs) when dealing with large-scale language models. Specifically: 1. **Computational Cost Issue**: The current SAE architecture requires a significant amount of computational resources when scaled to handle cutting-edge language models such as Claude 3 Sonnet and GPT-4. As the model size continues to grow, existing training methods will become increasingly unsustainable. 2. **Feature Extraction Efficiency**: SAEs are used to decompose neural network activations to extract interpretable features. However, to enable SAEs to recognize all features in cutting-edge models, they need to be scaled to very high widths, which presents computational challenges. 3. **Optimization Objective**: The paper proposes a new SAE architecture—Switch Sparse Autoencoders (Switch SAEs), which aims to reduce the computational load required to train SAEs by routing input activations to smaller "expert" SAEs. This approach allows SAEs to scale more efficiently to a larger number of features. Through experimental validation, Switch SAEs have shown significant improvements in the trade-off between reconstruction and sparsity under a fixed computational budget, and the interpretability of their features is comparable to other SAE architectures.