Abstract:We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from language models. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it language models. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn biology-related knowledge with minimal side-effects. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the existing fine-tuning based techniques.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore whether Sparse Autoencoders (SAEs) can be used to remove specific knowledge from language models. Specifically, the authors used the biology subset in the Weapons of Mass Destruction Proxy dataset and tested it on two language models (gemma - 2b - it and gemma - 2 - 2b - it). They hope to find an interpretable method to "forget" or remove information related to biological weapons in language models, so as to ensure that these models do not contain dangerous capabilities or inaccurate information before deployment. #### Main research objectives include: 1. **Verify the applicability of SAEs**: Determine whether sparse autoencoders can be used as an interpretable method to achieve knowledge forgetting. 2. **Evaluate the intervention effect**: By adjusting the SAE feature activation values, evaluate its impact on model performance, especially the effect of removing harmful knowledge and its side effects. 3. **Compare existing methods**: Compare the SAE - based forgetting technique with the existing Representation Misdirection for Unlearning (RMU) technique and analyze their respective advantages and disadvantages. #### Research background and motivation: With the development of language models, they may learn some inaccurate information, produce toxic outputs, or have potentially dangerous capabilities (such as advanced biological weapons knowledge). Therefore, before these models are deployed, it is necessary to develop accurate and robust methods to remove these undesirable capabilities or information. However, it is still unclear how to remove specific knowledge or capabilities from language models in a precise and reliable manner. #### Method overview: The authors used sparse autoencoders to learn the sparse reconstruction of language model activation and attempted to achieve knowledge forgetting by intervening in these feature activations. Specifically, they selected SAE features related to biology and intervened by setting their activation values to negative numbers. In addition, they also evaluated the effects of different intervention strategies, including feature scaling and simultaneous intervention of multiple features. #### Key findings: 1. **Effectiveness of a single feature**: Some specific biology - related SAE features can effectively remove related knowledge, but zero - out feature activation is ineffective and negative scaling must be carried out. 2. **Challenges of multi - feature intervention**: Although intervening in multiple features simultaneously can remove more different types of knowledge, this will lead to greater side effects. 3. **Comparison with existing methods**: The SAE - based forgetting technique is comparable to the RMU technique in removing specific knowledge, but still needs improvement in reducing side effects. In conclusion, this research provides new ideas for developing more transparent and verifiable large - scale knowledge removal methods, but there is still a lot of work that needs to be further explored and improved.

Applying sparse autoencoders to unlearn knowledge in language models

Fine-grained Pluggable Gradient Ascent for Knowledge Unlearning in Language Models

CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

Do Unlearning Methods Remove Information from Language Model Weights?

Can sparse autoencoders make sense of latent representations?

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

UNLEARN Efficient Removal of Knowledge in Large Language Models

Disentangling Dense Embeddings with Sparse Autoencoders

Improving Dictionary Learning with Gated Sparse Autoencoders

Towards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models

A Closer Look at Machine Unlearning for Large Language Models

Cross-Lingual Unlearning of Selective Knowledge in Multilingual Language Models

Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

Machine Unlearning in Large Language Models

Decomposing The Dark Matter of Sparse Autoencoders

Towards Safer Large Language Models through Machine Unlearning

Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models

Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models

Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Analyzing (In)Abilities of SAEs via Formal Languages