Steering Language Model Refusal with Sparse Autoencoders

Kyle O'Brien,David Majercak,Xavier Fernandes,Richard Edgar,Jingya Chen,Harsha Nori,Dean Carignan,Eric Horvitz,Forough Poursabzi-Sangde
2024-11-18
Abstract:Responsible practices for deploying language models include guiding models to recognize and refuse answering prompts that are considered unsafe, while complying with safe prompts. Achieving such behavior typically requires updating model weights, which is costly and inflexible. We explore opportunities to steering model activations at inference time, which does not require updating weights. Using sparse autoencoders, we identify and steer features in Phi-3 Mini that mediate refusal behavior. We find that feature steering can improve Phi-3 Minis robustness to jailbreak attempts across various harms, including challenging multi-turn attacks. However, we discover that feature steering can adversely affect overall performance on benchmarks. These results suggest that identifying steerable mechanisms for refusal via sparse autoencoders is a promising approach for enhancing language model safety, but that more research is needed to mitigate feature steerings adverse effects on performance.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve how to guide the model to recognize and refuse to answer prompts considered unsafe while correctly responding to safe prompts when deploying language models (LMs). Specifically: 1. **Improve the safety of language models**: The paper explores enhancing the safety of language models during inference through feature steering without updating model weights. This enables the model to more effectively reject unsafe prompts, including complex challenges in multi - round attacks. 2. **Reduce the impact on model performance**: Although feature steering can improve safety, it may also have a negative impact on the overall performance of the model. The paper studies this trade - off and attempts to find a way to minimize these negative impacts. 3. **Avoid the cost and flexibility issues of retraining**: Traditionally, in order to enable the model to recognize and reject unsafe prompts, it is usually necessary to update model weights, which is both expensive and not flexible enough. The paper proposes a method of directly intervening in model activations during inference, thus avoiding the need for retraining. ### Specific methods of the paper - **Use Sparse Autoencoders (SAEs)**: The paper trains SAEs to identify and steer features related to rejection behavior in the Phi - 3 Mini model. SAEs encode the activations of the model into sparse vectors, and then the behavior of the model can be guided by manually setting the activation values of these features. - **Feature identification and steering**: By analyzing the model's responses to specific unsafe prompts, key features are identified, and the rejection behavior of the model is enhanced or weakened by adjusting the activation values of these features. - **Evaluate safety**: The impact of feature steering on the rejection rate in single - round and multi - round dialogues is evaluated through multiple benchmark tests (such as Wild Guard, XSTest, Crescendo). - **Evaluate overall performance**: The impact of feature steering on the overall performance of the model is evaluated through benchmark tests such as MMLU, TruthfulQA, and GSM8K. ### Main findings 1. **Simple feature identification**: Multiple features related to rejection behavior can be found through a single hand - crafted prompt. 2. **Feature steering improves safety**: Steering the rejection features of Phi - 3 Mini can significantly increase the rejection rate of unsafe prompts, especially in single - round and multi - round dialogues. 3. **Negative impact of feature steering on overall performance**: Feature steering will lead to excessive rejection of safe prompts and a decline in performance in terms of fact recall and reasoning ability. ### Conclusion The paper shows that feature steering through sparse autoencoders is a promising method to enhance the safety of language models without retraining the model. However, this method also has a negative impact on the overall performance of the model, so further research is needed to reduce these negative impacts.