Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Rima Hazra,Sayan Layek,Somnath Banerjee,Soujanya Poria
2024-10-29
Abstract:Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently perform safety alignment on large language models (LLMs) in different usage scenarios while maintaining the general capabilities of the models. Specifically, the paper focuses on the following three main scenarios: 1. **Base models**: These models are usually used directly after large - scale pre - training, but may generate harmful content due to the inherent biases in the training data. 2. **Supervised fine - tuned models (SFT)**: These models are fine - tuned on specific tasks, but certain biases or harmful behaviors may be amplified during the fine - tuning process. 3. **Edited models**: These models may have unexpected harmful consequences due to interventions or modifications after knowledge updates. ### Main contributions of the paper To address the above challenges, the paper proposes a framework named **SAFETY ARITHMETIC**, which is a training - free safety alignment technique. This framework consists of two main stages: 1. **Harm Direction Removal (HDR)**: - Reduce the risk of generating harmful content by identifying and removing harmful directions in the model parameters. The specific steps include: - Fine - tune the model using a small dataset of harmful question - answer pairs \( D_H \) to obtain the harmful model \( \theta_H \). - Calculate the harmful vector \( \tau_H=\theta_H - \theta_b \). - Select the \( k \) most important parameters in the harmful vector to form a new harmful vector \( \tau'_H \). - Apply \( \tau'_H \) to the target model \( \theta_t \) to obtain the intermediate model \( \hat{\theta}_t \). 2. **Safety Alignment (Safe - Align)**: - Guide the model to generate safe responses by adjusting the latent space of the model. The specific steps include: - Prepare a series of context examples \( D_{icl} \), including harmful and safe prompts. - Calculate the In - Context Safety Vector (ICV), which makes the latent state of the model closer to the representation of safe prompts. - Add the ICV to the latent state of the intermediate model \( \hat{\theta}_t \) to obtain the final safe model \( \theta_{sf} \). ### Experimental results The paper verifies the effectiveness of the SAFETY ARITHMETIC framework through multiple benchmark tests and datasets. The experimental results show that this framework can significantly reduce the proportion of harmful content generated by the model (Attack Success Rate, ASR), while maintaining the performance and general capabilities of the model. This is specifically manifested in the following aspects: - **Base models**: On multiple datasets, SAFETY ARITHMETIC significantly reduces the ASR. For example, on the AdvBench dataset, the ASR of Llama2 and Mistral is reduced from 19.81% and 60.96% to 6.15% and 24.23% respectively. - **Supervised fine - tuned models**: On models such as WizardMath, LlamaMath, and EvolCodeAlpaca, SAFETY ARITHMETIC also significantly reduces the ASR. For example, on the AdvBench dataset, it is reduced from 79.62%, 56.73% and 92.19% to 37.69%, 15.58% and 51.54% respectively. - **Edited models**: For both unintentionally and intentionally edited models, SAFETY ARITHMETIC can effectively reduce the ASR. For example, on the HEx - PHI dataset, it is reduced from 43.64% to 6.97%.