Applying sparse autoencoders to unlearn knowledge in language models

Eoin Farrell,Yeu-Tong Lau,Arthur Conmy
2024-10-25
Abstract:We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from language models. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it language models. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn biology-related knowledge with minimal side-effects. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the existing fine-tuning based techniques.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to explore whether Sparse Autoencoders (SAEs) can be used to remove specific knowledge from language models. Specifically, the authors used the biology subset in the Weapons of Mass Destruction Proxy dataset and tested it on two language models (gemma - 2b - it and gemma - 2 - 2b - it). They hope to find an interpretable method to "forget" or remove information related to biological weapons in language models, so as to ensure that these models do not contain dangerous capabilities or inaccurate information before deployment. #### Main research objectives include: 1. **Verify the applicability of SAEs**: Determine whether sparse autoencoders can be used as an interpretable method to achieve knowledge forgetting. 2. **Evaluate the intervention effect**: By adjusting the SAE feature activation values, evaluate its impact on model performance, especially the effect of removing harmful knowledge and its side effects. 3. **Compare existing methods**: Compare the SAE - based forgetting technique with the existing Representation Misdirection for Unlearning (RMU) technique and analyze their respective advantages and disadvantages. #### Research background and motivation: With the development of language models, they may learn some inaccurate information, produce toxic outputs, or have potentially dangerous capabilities (such as advanced biological weapons knowledge). Therefore, before these models are deployed, it is necessary to develop accurate and robust methods to remove these undesirable capabilities or information. However, it is still unclear how to remove specific knowledge or capabilities from language models in a precise and reliable manner. #### Method overview: The authors used sparse autoencoders to learn the sparse reconstruction of language model activation and attempted to achieve knowledge forgetting by intervening in these feature activations. Specifically, they selected SAE features related to biology and intervened by setting their activation values to negative numbers. In addition, they also evaluated the effects of different intervention strategies, including feature scaling and simultaneous intervention of multiple features. #### Key findings: 1. **Effectiveness of a single feature**: Some specific biology - related SAE features can effectively remove related knowledge, but zero - out feature activation is ineffective and negative scaling must be carried out. 2. **Challenges of multi - feature intervention**: Although intervening in multiple features simultaneously can remove more different types of knowledge, this will lead to greater side effects. 3. **Comparison with existing methods**: The SAE - based forgetting technique is comparable to the RMU technique in removing specific knowledge, but still needs improvement in reducing side effects. In conclusion, this research provides new ideas for developing more transparent and verifiable large - scale knowledge removal methods, but there is still a lot of work that needs to be further explored and improved.