Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Zhengfu He,Wentao Shu,Xuyang Ge,Lingjie Chen,Junxuan Wang,Yunhua Zhou,Frances Liu,Qipeng Guo,Xuanjing Huang,Zuxuan Wu,Yu-Gang Jiang,Xipeng Qiu
2024-10-28
Abstract:Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{<a class="link-external link-https" href="https://huggingface.co/fnlp/Llama-Scope" rel="external noopener nofollow">this https URL</a>}, alongside our scalable training, interpretation, and visualization tools at \url{<a class="link-external link-https" href="https://github.com/OpenMOSS/Language-Model-SAEs" rel="external noopener nofollow">this https URL</a>}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.
Machine Learning,Computation and Language