Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide,Joshua Engels,Eric J. Michaud,Max Tegmark,Christian Schroeder de Witt

2024-10-11

Abstract:Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to scale them up to very high width, posing a computational challenge. In this work, we introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs. Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs, enabling SAEs to efficiently scale to many more features. We present experiments comparing Switch SAEs with other SAE architectures, and find that Switch SAEs deliver a substantial Pareto improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget. We also study the geometry of features across experts, analyze features duplicated across experts, and verify that Switch SAE features are as interpretable as features found by other SAE architectures.

Machine Learning

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the computational cost issues faced by Sparse Autoencoders (SAEs) when dealing with large-scale language models. Specifically: 1. **Computational Cost Issue**: The current SAE architecture requires a significant amount of computational resources when scaled to handle cutting-edge language models such as Claude 3 Sonnet and GPT-4. As the model size continues to grow, existing training methods will become increasingly unsustainable. 2. **Feature Extraction Efficiency**: SAEs are used to decompose neural network activations to extract interpretable features. However, to enable SAEs to recognize all features in cutting-edge models, they need to be scaled to very high widths, which presents computational challenges. 3. **Optimization Objective**: The paper proposes a new SAE architecture—Switch Sparse Autoencoders (Switch SAEs), which aims to reduce the computational load required to train SAEs by routing input activations to smaller "expert" SAEs. This approach allows SAEs to scale more efficiently to a larger number of features. Through experimental validation, Switch SAEs have shown significant improvements in the trade-off between reconstruction and sparsity under a fixed computational budget, and the interpretability of their features is comparable to other SAE architectures.

Efficient Dictionary Learning with Switch Sparse Autoencoders

Improving Dictionary Learning with Gated Sparse Autoencoders

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

Disentangling Dense Embeddings with Sparse Autoencoders

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Interpreting Attention Layer Outputs with Sparse Autoencoders

Analyzing (In)Abilities of SAEs via Formal Languages

Scaling and evaluating sparse autoencoders

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Robust and structural sparsity auto-encoder with L21-norm minimization

Sparse-Coding Variational Auto-Encoders

Sparse-Coding Variational Autoencoders

Building Feature Space of Extreme Learning Machine with Sparse Denoising Stacked-Autoencoder.

Automatically Interpreting Millions of Features in Large Language Models

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders