Abstract:As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe and aligned responses is a pressing need. Previous research on alignment has largely focused on general instruction-following but has often overlooked the unique properties and challenges of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction - interpreted as a specialized binary classification task - and incorporate a refusal mechanism with multiple reserved fallback options. Furthermore, through SSAH, we hypothesize that safety guardrails in LLMs can be established by just a small number of essential components. To verify this, we conduct an ablation study and successfully identify four types of attribute-critical components in safety-aligned LLMs: Exclusive Safety Unit (ESU), Exclusive Utility Unit (EUU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components 7.5\% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Additionally, we show that leveraging redundant units 20\% in the pre-trained model as an ``alignment budget'' can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We believe this work contributes to the foundation of efficient and scalable safety alignment for future LLMs.

Decoupled Alignment for Robust Plug-and-Play Adaptation

Aligners: Decoupling LLMs and Alignment

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy

Aligner: Efficient Alignment by Learning to Correct

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

PURE: Aligning LLM Via Pluggable Query Reformulation for Enhanced Helpfulness

Superficial Safety Alignment Hypothesis

DaRec: A Disentangled Alignment Framework for Large Language Model and Recommender System

Human-Instruction-Free LLM Self-Alignment with Limited Samples

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Aligning Large Language Models with Representation Editing: A Control Perspective

Aligning Large Language Models via Fine-grained Supervision

Progressively Label Enhancement for Large Language Model Alignment

Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

Weak-to-Strong Extrapolation Expedites Alignment

Latent Distance Guided Alignment Training for Large Language Models