Abstract:Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models, demonstrating their effectiveness in disentangling semantic concepts. By training SAEs on embeddings of over 420,000 scientific paper abstracts from computer science and astronomy, we show that the resulting sparse representations maintain semantic fidelity while offering interpretability. We analyse these learned features, exploring their behaviour across different model capacities and introducing a novel method for identifying ``feature families'' that represent related concepts at varying levels of abstraction. To demonstrate the practical utility of our approach, we show how these interpretable features can be used to precisely steer semantic search, allowing for fine-grained control over query semantics. This work bridges the gap between the semantic richness of dense embeddings and the interpretability of sparse representations. We open source our embeddings, trained sparse autoencoders, and interpreted features, as well as a web app for exploring them.

What problem does this paper attempt to address?

### Problems Attempted to Solve by the Paper The paper primarily attempts to address the following issues: 1. **Interpretability of Dense Embedding Representations**: - Dense text embeddings generated by large language models (such as BERT, GPT, etc.) perform well in natural language processing tasks, but these high-dimensional continuous vector representations are difficult to understand and control in terms of specific semantics. This leads to challenges in interpretability and fine-grained control in practical applications. 2. **Precise Control of Semantic Search**: - Dense embeddings are widely used in information retrieval and semantic search, but their opacity limits fine-tuning of search results. For example, in academic literature retrieval, users wish to precisely control the query semantics, which is difficult to achieve with existing methods. ### Main Contributions of the Paper 1. **Application of Sparse Autoencoders**: - For the first time, Sparse Autoencoders (SAEs) are applied to dense text embeddings generated by large language models, demonstrating the effectiveness of this approach in decoupling semantic concepts. 2. **Feature Interpretability Analysis**: - By training dense embeddings extracted from abstracts of computer science and astronomy papers, the paper demonstrates that sparse representations provide interpretability while maintaining semantic fidelity. By analyzing the features learned under different model capacities, the behavior and semantic properties of these features are explored. 3. **Concept of Feature Families**: - The concept of "feature families," a group of related features, is introduced, allowing for multi-scale semantic analysis and manipulation. By analyzing the "splitting" of features at different levels of abstraction, the flexibility of semantic understanding is further enhanced. 4. **Application in Enhanced Semantic Search**: - The paper demonstrates how these interpretable features can be used to precisely guide semantic search, achieving fine-grained control over query semantics. An open-source tool was developed to implement a SAE-based semantic search system, and the related models and features were made open-source. Through the above research, the paper bridges the gap between the semantic richness of dense embedding representations and the interpretability of sparse representations, providing new possibilities for understanding and manipulating the semantic space of text.

Disentangling Dense Embeddings with Sparse Autoencoders

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Analyzing (In)Abilities of SAEs via Formal Languages

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

Automatically Interpreting Millions of Features in Large Language Models

Can sparse autoencoders make sense of latent representations?

SPINE: SParse Interpretable Neural Embeddings

Efficient Dictionary Learning with Switch Sparse Autoencoders

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Scaling and evaluating sparse autoencoders

An Exploration Of Semantic Relations In Neural Word Embeddings Using Extrinsic Knowledge

Improving Dictionary Learning with Gated Sparse Autoencoders