Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov,George Lange,Neel Nanda
2024-05-21
Abstract:Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the difficult problem of evaluating sparse autoencoders (SAEs) in terms of interpretability and controllability. Specifically, researchers are concerned with how to decompose the activation of the model into meaningful features and verify the effectiveness of these features. However, in practical scenarios, it is very difficult to verify recent methods (such as sparse dictionary learning) due to the lack of true labels for these features. To meet this challenge, the author proposes a framework to evaluate the feature dictionary in a specific task by comparing it with a supervised feature dictionary. The main objectives include: 1. **Interpretability**: Verify whether SAEs can capture interpretable features. 2. **Controllability**: Evaluate the performance of SAEs when editing internal representations, that is, whether precise control of the model's behavior can be achieved by modifying a small number of features. 3. **Reconstruction quality**: Check the ability of SAEs to reconstruct activation and ensure that these reconstructions are sufficient and necessary. In addition, the author also observed two qualitative phenomena in SAE training: - **Feature Occlusion**: Causally related concepts are masked by features with higher magnitudes. - **Feature Over - Splitting**: Binary features are split into multiple smaller, more difficult - to - interpret features. Through this method, the author hopes to provide a more objective and well - founded evaluation method for sparse dictionary learning methods, thereby promoting the progress in this field. ### Main contributions 1. **Proposed a principled method** for calculating the sparse feature dictionary of a language model in real - world tasks, using the attributes of description prompts for supervision. 2. **Applied this method to the IOI task** and demonstrated that these dictionaries exhibit three desirable characteristics in the task context: sufficiency and necessity of reconstruction, sparse controllability, and consistency of interpretation. 3. **Designed and contextually evaluated unsupervised feature dictionaries** along the above three dimensions and without relying on whether the unsupervised dictionary uses the same concepts as the supervised dictionary. 4. **Applied to SAEs trained on different datasets** and found that task - specific SAEs require fewer features to be changed when editing attributes, but neither type of SAEs can outperform the supervised dictionary. 5. **Thoroughly explored the qualitative phenomena in task - specific SAEs** and reproduced these phenomena in a simple toy model, indicating that they may have broader applicability. ### Conclusion This study emphasizes the need for more principled training and evaluation methods in this active field and shows that supervised feature dictionaries can be a valuable tool for automating certain aspects of the process.