Abstract:Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a sparsifying activation function that implicitly defines a set of token-feature matches. We frame the token-feature matching as a resource allocation problem constrained by a total sparsity upper bound. For example, TopK SAEs solve this allocation problem with the additional constraint that each token matches with at most $k$ features. In TopK SAEs, the $k$ active features per token constraint is the same across tokens, despite some tokens being more difficult to reconstruct than others. To address this limitation, we propose two novel SAE variants, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token. Feature Choice SAEs solve the sparsity allocation problem under the additional constraint that each feature matches with at most $m$ tokens. Mutual Choice SAEs solve the unrestricted allocation problem where the total sparsity budget can be allocated freely between tokens and features. Additionally, we introduce a new auxiliary loss function, $\mathtt{aux\_zipf\_loss}$, which generalises the $\mathtt{aux\_k\_loss}$ to mitigate dead and underutilised features. Our methods result in SAEs with fewer dead features and improved reconstruction loss at equivalent sparsity levels as a result of the inherent adaptive computation. More accurate and scalable feature extraction methods provide a path towards better understanding and more precise control of foundation models.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is **how to improve the feature extraction ability and model interpretability of Sparse Auto - Encoders (SAEs) in neural networks while reducing the problem of "dead features"**. Specifically, the author proposes two new SAE variants - Feature Choice SAEs and Mutual Choice SAEs, aiming to more flexibly allocate sparse feature activation through Adaptive Computation, thereby improving the reconstruction accuracy of the model and reducing invalid features. ### Detailed Interpretation 1. **Limitations of Existing Methods**: - **TopK SAEs**: This traditional SAE method fixes the same number of active features (k) for each token without considering that some tokens may be more difficult to reconstruct than others, resulting in waste or shortage of resources. - **Dead Feature Problem**: There are a large number of "dead features" in many existing SAE methods, that is, these features have hardly been activated throughout the entire input dataset, which not only wastes the model capacity but also affects the training efficiency. 2. **Proposed New Methods**: - **Feature Choice SAEs**: This method allows each feature to match at most m tokens, ensuring that all features can be activated at least once in each batch, thus avoiding the problem of dead features. - **Mutual Choice SAEs**: This method freely allocates the total sparsity budget, allowing the matching relationship between each token and feature to be flexibly adjusted as needed, achieving adaptive computation. 3. **Auxiliary Loss Function**: - A new auxiliary loss function `aux_zipf_loss` is introduced to alleviate the under - utilization problem of features and further improve the effectiveness and stability of the model. 4. **Experimental Results**: - Experiments show that these two new methods can significantly reduce the number of dead features at the same sparsity level and perform better in terms of reconstruction loss. - In particular, Feature Choice SAEs achieves a 0% dead feature rate in large - scale models. 5. **Theoretical Contributions**: - The paper formalizes the generation problem of sparse activation functions as a resource allocation problem, providing a new perspective to understand and optimize the design of SAE. - A phased training method (first Mutual Choice training, then Feature Choice training) is proposed, which may bring performance improvements. ### Formula Summary - **Reconstruction Error**: \[ L(x)=\|x - \hat{x}\|_2^2+\lambda_1 L_{\text{sparsity}}(z)+\lambda_2 L_{\text{aux}}(x, z, \hat{x}) \] where $L_{\text{sparsity}}$ is the sparsity loss term and $L_{\text{aux}}$ is the auxiliary loss term. - **Zipf Distribution Feature Density**: \[ m_i = \text{Zipf}(i)\propto\frac{1}{(i + \beta)^\alpha} \] where $m_i$ represents the maximum activation times of the $i$-th feature, and $\alpha$ and $\beta$ are hyperparameters. Through these improvements, the paper provides new ideas and techniques for the design of sparse auto - encoders, especially when dealing with large - scale neural networks, which has important application value.

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

SAC: Accelerating and Structuring Self-Attention Via Sparse Adaptive Connection.

BatchTopK Sparse Autoencoders

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Scaling and evaluating sparse autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

Disentangling Dense Embeddings with Sparse Autoencoders

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Improving Dictionary Learning with Gated Sparse Autoencoders

Unveiling the Power of Sparse Neural Networks for Feature Selection

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Automatically Interpreting Millions of Features in Large Language Models