Abstract:A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. In this paper, we investigate sparse inference and learning in SAEs through the lens of sparse coding. Specifically, we show that SAEs perform amortised sparse inference with a computationally restricted encoder and, using compressed sensing theory, we prove that this mapping is inherently insufficient for accurate sparse inference, even in solvable cases. Building on this theory, we empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our key contribution is the decoupling of the encoding and decoding processes, which allows for a comparison of various sparse encoding strategies. We evaluate these strategies on two dimensions: alignment with true underlying sparse features and correct inference of sparse codes, while also accounting for computational costs during training and inference. Our results reveal that substantial performance gains can be achieved with minimal increases in compute cost. We demonstrate that this generalises to SAEs applied to large language models (LLMs), where advanced encoders achieve similar interpretability. This work opens new avenues for understanding neural network representations and offers important implications for improving the tools we use to analyse the activations of large language models.

Squeezing bottlenecks: exploring the limits of autoencoder semantic representation capabilities

A Hierarchical Neural Autoencoder for Paragraphs and Documents

Disentangling Dense Embeddings with Sparse Autoencoders

Scaling and evaluating sparse autoencoders

Auto-Encoders in Deep Learning—A Review with New Perspectives

An Automatic Grading Model for Semantic Complexity of English Texts Using Bidirectional Attention-Based Autoencoder

Are We Using Autoencoders in a Wrong Way?

Analyzing (In)Abilities of SAEs via Formal Languages

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and Depth

Exploring the Representational Power of Graph Autoencoder

Auto-Encoder Based Dimensionality Reduction

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Exploring the Latent Space of Autoencoders with Interventional Assays

Adversarially Regularized Autoencoders

A comprehensive survey on design and application of autoencoder in deep learning

Dimension Estimation Using Autoencoders

Revisiting Bayesian Autoencoders With MCMC

Dimensionality Reduction Strategy Based on Auto-Encoder

Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Autoencoders and their applications in machine learning: a survey