Disentangling Dense Embeddings with Sparse Autoencoders

Charles O'Neill,Christine Ye,Kartheik Iyer,John F. Wu
2024-08-05
Abstract:Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models, demonstrating their effectiveness in disentangling semantic concepts. By training SAEs on embeddings of over 420,000 scientific paper abstracts from computer science and astronomy, we show that the resulting sparse representations maintain semantic fidelity while offering interpretability. We analyse these learned features, exploring their behaviour across different model capacities and introducing a novel method for identifying ``feature families'' that represent related concepts at varying levels of abstraction. To demonstrate the practical utility of our approach, we show how these interpretable features can be used to precisely steer semantic search, allowing for fine-grained control over query semantics. This work bridges the gap between the semantic richness of dense embeddings and the interpretability of sparse representations. We open source our embeddings, trained sparse autoencoders, and interpreted features, as well as a web app for exploring them.
Machine Learning
What problem does this paper attempt to address?
### Problems Attempted to Solve by the Paper The paper primarily attempts to address the following issues: 1. **Interpretability of Dense Embedding Representations**: - Dense text embeddings generated by large language models (such as BERT, GPT, etc.) perform well in natural language processing tasks, but these high-dimensional continuous vector representations are difficult to understand and control in terms of specific semantics. This leads to challenges in interpretability and fine-grained control in practical applications. 2. **Precise Control of Semantic Search**: - Dense embeddings are widely used in information retrieval and semantic search, but their opacity limits fine-tuning of search results. For example, in academic literature retrieval, users wish to precisely control the query semantics, which is difficult to achieve with existing methods. ### Main Contributions of the Paper 1. **Application of Sparse Autoencoders**: - For the first time, Sparse Autoencoders (SAEs) are applied to dense text embeddings generated by large language models, demonstrating the effectiveness of this approach in decoupling semantic concepts. 2. **Feature Interpretability Analysis**: - By training dense embeddings extracted from abstracts of computer science and astronomy papers, the paper demonstrates that sparse representations provide interpretability while maintaining semantic fidelity. By analyzing the features learned under different model capacities, the behavior and semantic properties of these features are explored. 3. **Concept of Feature Families**: - The concept of "feature families," a group of related features, is introduced, allowing for multi-scale semantic analysis and manipulation. By analyzing the "splitting" of features at different levels of abstraction, the flexibility of semantic understanding is further enhanced. 4. **Application in Enhanced Semantic Search**: - The paper demonstrates how these interpretable features can be used to precisely guide semantic search, achieving fine-grained control over query semantics. An open-source tool was developed to implement a SAE-based semantic search system, and the related models and features were made open-source. Through the above research, the paper bridges the gap between the semantic richness of dense embedding representations and the interpretability of sparse representations, providing new possibilities for understanding and manipulating the semantic space of text.