Abstract:A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. In this paper, we investigate sparse inference and learning in SAEs through the lens of sparse coding. Specifically, we show that SAEs perform amortised sparse inference with a computationally restricted encoder and, using compressed sensing theory, we prove that this mapping is inherently insufficient for accurate sparse inference, even in solvable cases. Building on this theory, we empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our key contribution is the decoupling of the encoding and decoding processes, which allows for a comparison of various sparse encoding strategies. We evaluate these strategies on two dimensions: alignment with true underlying sparse features and correct inference of sparse codes, while also accounting for computational costs during training and inference. Our results reveal that substantial performance gains can be achieved with minimal increases in compute cost. We demonstrate that this generalises to SAEs applied to large language models (LLMs), where advanced encoders achieve similar interpretability. This work opens new avenues for understanding neural network representations and offers important implications for improving the tools we use to analyse the activations of large language models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of Sparse Autoencoders (SAEs) in performing sparse inference. Specifically, the author points out that although SAEs can extract interpretable features from neural network representations, their simple linear - nonlinear encoding mechanism limits their ability to perform accurate sparse inference. In addition, due to the limitation of computational resources, there is an "amortisation gap" in SAEs when performing sparse inference, that is, the difference between the optimal sparse code predicted by the SAE encoder and the optimal sparse code that an unconstrained sparse inference algorithm may produce. ### Main problem summary: 1. **Accuracy of sparse inference**: The simple encoding mechanism of SAEs makes it impossible to perform accurate sparse inference under computational constraints. 2. **Amortisation gap**: When performing sparse inference, SAEs cannot reach the optimal sparse code due to the limitation of computational resources, thus resulting in an amortisation gap. 3. **Optimizing sparse inference methods**: Explore whether more complex sparse inference methods can surpass the traditional SAE encoder and improve performance while keeping the computational cost minimized. ### Core contributions of the paper: - **Decoupling the encoding and decoding processes**: By separating the encoding and decoding processes, the author can compare different sparse coding strategies and evaluate their performance in aligning real sparse features and correctly inferring sparse codes. - **Experimental verification**: The author conducted experiments on synthetic datasets and practical applications (such as the activations of the large - language model GPT - 2), showing that more complex methods can significantly improve performance with a relatively small increase in computational cost. - **Theoretical analysis**: Use compressed sensing theory to prove the inherent limitations of SAEs in sparse inference and propose directions for improvement. Through these studies, the paper provides a new perspective for understanding and improving neural network representations, especially when dealing with the activations of large - language models, which has important application value.

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Disentangling Dense Embeddings with Sparse Autoencoders

Analyzing (In)Abilities of SAEs via Formal Languages

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Improving Dictionary Learning with Gated Sparse Autoencoders

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Sparse-Coding Variational Auto-Encoders

Sparse-Coding Variational Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Automatically Interpreting Millions of Features in Large Language Models

Scaling and evaluating sparse autoencoders

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Can sparse autoencoders make sense of latent representations?