Abstract:Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently perform safety alignment on large language models (LLMs) in different usage scenarios while maintaining the general capabilities of the models. Specifically, the paper focuses on the following three main scenarios: 1. **Base models**: These models are usually used directly after large - scale pre - training, but may generate harmful content due to the inherent biases in the training data. 2. **Supervised fine - tuned models (SFT)**: These models are fine - tuned on specific tasks, but certain biases or harmful behaviors may be amplified during the fine - tuning process. 3. **Edited models**: These models may have unexpected harmful consequences due to interventions or modifications after knowledge updates. ### Main contributions of the paper To address the above challenges, the paper proposes a framework named **SAFETY ARITHMETIC**, which is a training - free safety alignment technique. This framework consists of two main stages: 1. **Harm Direction Removal (HDR)**: - Reduce the risk of generating harmful content by identifying and removing harmful directions in the model parameters. The specific steps include: - Fine - tune the model using a small dataset of harmful question - answer pairs \( D_H \) to obtain the harmful model \( \theta_H \). - Calculate the harmful vector \( \tau_H=\theta_H - \theta_b \). - Select the \( k \) most important parameters in the harmful vector to form a new harmful vector \( \tau'_H \). - Apply \( \tau'_H \) to the target model \( \theta_t \) to obtain the intermediate model \( \hat{\theta}_t \). 2. **Safety Alignment (Safe - Align)**: - Guide the model to generate safe responses by adjusting the latent space of the model. The specific steps include: - Prepare a series of context examples \( D_{icl} \), including harmful and safe prompts. - Calculate the In - Context Safety Vector (ICV), which makes the latent state of the model closer to the representation of safe prompts. - Add the ICV to the latent state of the intermediate model \( \hat{\theta}_t \) to obtain the final safe model \( \theta_{sf} \). ### Experimental results The paper verifies the effectiveness of the SAFETY ARITHMETIC framework through multiple benchmark tests and datasets. The experimental results show that this framework can significantly reduce the proportion of harmful content generated by the model (Attack Success Rate, ASR), while maintaining the performance and general capabilities of the model. This is specifically manifested in the following aspects: - **Base models**: On multiple datasets, SAFETY ARITHMETIC significantly reduces the ASR. For example, on the AdvBench dataset, the ASR of Llama2 and Mistral is reduced from 19.81% and 60.96% to 6.15% and 24.23% respectively. - **Supervised fine - tuned models**: On models such as WizardMath, LlamaMath, and EvolCodeAlpaca, SAFETY ARITHMETIC also significantly reduces the ASR. For example, on the AdvBench dataset, it is reduced from 79.62%, 56.73% and 92.19% to 37.69%, 15.58% and 51.54% respectively. - **Edited models**: For both unintentionally and intentionally edited models, SAFETY ARITHMETIC can effectively reduce the ASR. For example, on the HEx - PHI dataset, it is reduced from 43.64% to 6.97%.

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Superficial Safety Alignment Hypothesis

A safety realignment framework via subspace-oriented model fusion for large language models

Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Finding Safety Neurons in Large Language Models

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs

Locking Down the Finetuned LLMs Safety

SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch