Abstract:Responsible practices for deploying language models include guiding models to recognize and refuse answering prompts that are considered unsafe, while complying with safe prompts. Achieving such behavior typically requires updating model weights, which is costly and inflexible. We explore opportunities to steering model activations at inference time, which does not require updating weights. Using sparse autoencoders, we identify and steer features in Phi-3 Mini that mediate refusal behavior. We find that feature steering can improve Phi-3 Minis robustness to jailbreak attempts across various harms, including challenging multi-turn attacks. However, we discover that feature steering can adversely affect overall performance on benchmarks. These results suggest that identifying steerable mechanisms for refusal via sparse autoencoders is a promising approach for enhancing language model safety, but that more research is needed to mitigate feature steerings adverse effects on performance.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve how to guide the model to recognize and refuse to answer prompts considered unsafe while correctly responding to safe prompts when deploying language models (LMs). Specifically: 1. **Improve the safety of language models**: The paper explores enhancing the safety of language models during inference through feature steering without updating model weights. This enables the model to more effectively reject unsafe prompts, including complex challenges in multi - round attacks. 2. **Reduce the impact on model performance**: Although feature steering can improve safety, it may also have a negative impact on the overall performance of the model. The paper studies this trade - off and attempts to find a way to minimize these negative impacts. 3. **Avoid the cost and flexibility issues of retraining**: Traditionally, in order to enable the model to recognize and reject unsafe prompts, it is usually necessary to update model weights, which is both expensive and not flexible enough. The paper proposes a method of directly intervening in model activations during inference, thus avoiding the need for retraining. ### Specific methods of the paper - **Use Sparse Autoencoders (SAEs)**: The paper trains SAEs to identify and steer features related to rejection behavior in the Phi - 3 Mini model. SAEs encode the activations of the model into sparse vectors, and then the behavior of the model can be guided by manually setting the activation values of these features. - **Feature identification and steering**: By analyzing the model's responses to specific unsafe prompts, key features are identified, and the rejection behavior of the model is enhanced or weakened by adjusting the activation values of these features. - **Evaluate safety**: The impact of feature steering on the rejection rate in single - round and multi - round dialogues is evaluated through multiple benchmark tests (such as Wild Guard, XSTest, Crescendo). - **Evaluate overall performance**: The impact of feature steering on the overall performance of the model is evaluated through benchmark tests such as MMLU, TruthfulQA, and GSM8K. ### Main findings 1. **Simple feature identification**: Multiple features related to rejection behavior can be found through a single hand - crafted prompt. 2. **Feature steering improves safety**: Steering the rejection features of Phi - 3 Mini can significantly increase the rejection rate of unsafe prompts, especially in single - round and multi - round dialogues. 3. **Negative impact of feature steering on overall performance**: Feature steering will lead to excessive rejection of safe prompts and a decline in performance in terms of fact recall and reasoning ability. ### Conclusion The paper shows that feature steering through sparse autoencoders is a promising method to enhance the safety of language models without retraining the model. However, this method also has a negative impact on the overall performance of the model, so further research is needed to reduce these negative impacts.

Steering Language Model Refusal with Sparse Autoencoders

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Refusal in Language Models Is Mediated by a Single Direction

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Extending Activation Steering to Broad Skills and Multiple Behaviours

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models

SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

On Large Language Models’ Resilience to Coercive Interrogation

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

On Prompt-Driven Safeguarding for Large Language Models

Programming Refusal with Conditional Activation Steering

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Who's asking? User personas and the mechanics of latent misalignment

Activation Scaling for Steering and Interpreting Language Models