Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun,Jordan Taylor,Nicholas Goldowsky-Dill,Lee Sharkey

2024-05-24

Abstract:Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary that reconstructs a network's internal activations, have been used to identify these features. However, SAEs may learn more about the structure of the datatset than the computational structure of the network. There is therefore only indirect reason to believe that the directions found in these dictionaries are functionally important to the network. We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features. E2e dictionary learning brings us closer to methods that can explain network behavior concisely and accurately. We release our library for training e2e SAEs and reproducing our analysis at

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

This paper attempts to address the problem of feature importance identification in neural networks, particularly the challenges in the field of mechanistic interpretability. Specifically, the paper focuses on how to identify functionally important features in neural networks through an end-to-end sparse dictionary learning method. Traditional Sparse Autoencoders (SAEs) can identify these features, but they may focus more on the structure of the dataset rather than the computational structure of the network, thus failing to directly demonstrate whether the identified directions are truly important to the network. To improve this situation, the authors propose an end-to-end sparse dictionary learning method that ensures the functional importance of the learned features by minimizing the Kullback-Leibler divergence (KL divergence) between the original model output distribution and the model output distribution after inserting SAE activation values. Compared to standard SAE methods, end-to-end SAEs (e2e SAEs) provide Pareto improvements in several aspects: explaining more of the network's performance, requiring fewer total features, and needing fewer features to be activated simultaneously for each data point, without compromising interpretability. Additionally, the paper explores the geometric and qualitative differences among several different types of SAEs and verifies whether the end-to-end SAE method can more efficiently capture features that are critical to network performance. The research results indicate that the end-to-end SAE method not only improves efficiency but also shows at least the same level of interpretability as traditional SAE features in its automated interpretability and qualitative analysis, thereby demonstrating the effectiveness of this approach.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Improving Dictionary Learning with Gated Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Analyzing (In)Abilities of SAEs via Formal Languages

Disentangling Dense Embeddings with Sparse Autoencoders

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

Kernel Regularized Nonlinear Dictionary Learning for Sparse Coding

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Automatically Interpreting Millions of Features in Large Language Models

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Sparse-Coding Variational Auto-Encoders

Sparse-Coding Variational Autoencoders

The Interpretable Dictionary in Sparse Coding