Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Xiangyu Qi,Ashwinee Panda,Kaifeng Lyu,Xiao Ma,Subhrajit Roy,Ahmad Beirami,Prateek Mittal,Peter Henderson

2024-06-10

Abstract:The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and provide evidence that current aligned LLMs are subject to this issue. We also show how these findings help explain multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. Importantly, we discuss how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. For instance, we show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits. Finally, we design a regularized finetuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.

Cryptography and Security,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the safety alignment of current large - language models (LLMs) is vulnerable to attacks. That is, these models may be "jailbroken" when facing simple attacks or benign fine - tuning, resulting in the generation of harmful content. The author believes that many of these vulnerabilities are related to a common fundamental problem: safety alignment is usually only shallow, that is, the alignment mainly affects the first few output tokens of the model. This shallow safety alignment makes it easy for the model to generate harmful content when the initial output deviates from the safe path. Specifically, the paper explains through case studies why shallow safety alignment exists and provides evidence that currently aligned LLMs do indeed have this problem. In addition, the paper also shows how these findings can help explain the multiple vulnerabilities recently discovered in LLMs, including adversarial suffix attacks, pre - filling attacks, decoding parameter attacks and fine - tuning attacks. The author further discusses how to mitigate these vulnerabilities by deepening safety alignment and proposes directions for future research, such as restricting updates on initial tokens through regularized fine - tuning targets, thereby improving the model's robustness to fine - tuning attacks. In summary, the paper aims to explore the limitations of current safety alignment methods and proposes strategies to enhance model security by deepening safety alignment.

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Superficial Safety Alignment Hypothesis

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

A safety realignment framework via subspace-oriented model fusion for large language models

Safety Layers in Aligned Large Language Models: The Key to LLM Security

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Locking Down the Finetuned LLMs Safety

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Finding Safety Neurons in Large Language Models

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack

Fake Alignment: Are LLMs Really Aligned Well?

Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes