Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Zhexin Zhang,Junxiao Yang,Pei Ke,Shiyao Cui,Chujie Zheng,Hongning Wang,Minlie Huang
2024-07-03
Abstract:LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions \emph{without} any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on \emph{out-of-distribution} (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9\% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at \url{<a class="link-external link-https" href="https://github.com/thu-coai/SafeUnlearning" rel="external noopener nofollow">this https URL</a>}.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper focuses on the security issues of large language models (LLMs), specifically the defense against "jailbreak attacks". Jailbreak attacks exploit carefully crafted prompts to induce harmful responses from the model. The study found that although different jailbreak attacks may result in different queries, they often lead to similar harmful responses based on the same harmful knowledge (e.g. detailed steps to make a bomb). Therefore, the paper proposes a direct "Safe Unlearning" method to prevent the generation of harmful responses by eliminating harmful knowledge in the model, rather than just identifying harmful queries. In experiments, even with only 20 original harmful questions for training without any jailbreak prompts, this method significantly reduces the attack success rate (ASR) of out-of-distribution (OOD) harmful queries for the model, outperforming the Llama2-7B-Chat model fine-tuned with approximately 0.1M safe samples. Furthermore, the study reveals the generalization ability of the Safe Unlearning method, which successfully defends against various combinations of jailbreak prompts and OOD harmful questions. The main contributions of the paper include: 1. Proposing unlearning as an effective principle for defending against jailbreak attacks and implementing it as Safe Unlearning, which significantly reduces ASR while maintaining the overall performance of the model. 2. Demonstrating the generalization ability of the unlearning method against jailbreak attacks, even without jailbreak prompts during training. 3. Providing insights into jailbreak attacks and unlearning defense through empirical analysis. Methodologically, Safe Unlearning consists of three loss functions, each used for eliminating harmful responses, learning reject responses, and maintaining the overall performance of the model. By controlling the optimization process with adaptive unlearning loss, the influence of the unlearning objective is weakened when the probability of harmful responses is already low, thereby improving training stability. The experimental results show that Safe Unlearning achieves the lowest ASR on both ID and OOD harmful questions, even without the use of jailbreak prompts during training. The ASR can approach zero, indicating the strong ability of the unlearning method in generalized defense.