Abstract:Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the difficult problem of evaluating sparse autoencoders (SAEs) in terms of interpretability and controllability. Specifically, researchers are concerned with how to decompose the activation of the model into meaningful features and verify the effectiveness of these features. However, in practical scenarios, it is very difficult to verify recent methods (such as sparse dictionary learning) due to the lack of true labels for these features. To meet this challenge, the author proposes a framework to evaluate the feature dictionary in a specific task by comparing it with a supervised feature dictionary. The main objectives include: 1. **Interpretability**: Verify whether SAEs can capture interpretable features. 2. **Controllability**: Evaluate the performance of SAEs when editing internal representations, that is, whether precise control of the model's behavior can be achieved by modifying a small number of features. 3. **Reconstruction quality**: Check the ability of SAEs to reconstruct activation and ensure that these reconstructions are sufficient and necessary. In addition, the author also observed two qualitative phenomena in SAE training: - **Feature Occlusion**: Causally related concepts are masked by features with higher magnitudes. - **Feature Over - Splitting**: Binary features are split into multiple smaller, more difficult - to - interpret features. Through this method, the author hopes to provide a more objective and well - founded evaluation method for sparse dictionary learning methods, thereby promoting the progress in this field. ### Main contributions 1. **Proposed a principled method** for calculating the sparse feature dictionary of a language model in real - world tasks, using the attributes of description prompts for supervision. 2. **Applied this method to the IOI task** and demonstrated that these dictionaries exhibit three desirable characteristics in the task context: sufficiency and necessity of reconstruction, sparse controllability, and consistency of interpretation. 3. **Designed and contextually evaluated unsupervised feature dictionaries** along the above three dimensions and without relying on whether the unsupervised dictionary uses the same concepts as the supervised dictionary. 4. **Applied to SAEs trained on different datasets** and found that task - specific SAEs require fewer features to be changed when editing attributes, but neither type of SAEs can outperform the supervised dictionary. 5. **Thoroughly explored the qualitative phenomena in task - specific SAEs** and reproduced these phenomena in a simple toy model, indicating that they may have broader applicability. ### Conclusion This study emphasizes the need for more principled training and evaluation methods in this active field and shows that supervised feature dictionaries can be a valuable tool for automating certain aspects of the process.

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Towards Unifying Interpretability and Control: Evaluation via Intervention

Scaling and evaluating sparse autoencoders

Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

The Interpretable Dictionary in Sparse Coding

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

A Survey of the Interpretability Aspect of Deep Learning Models

Improving Dictionary Learning with Gated Sparse Autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Disentangling Dense Embeddings with Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Exploring the Latent Space of Autoencoders with Interventional Assays

SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks