Abstract:LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions \emph{without} any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on \emph{out-of-distribution} (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9\% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at \url{<a class="link-external link-https" href="https://github.com/thu-coai/SafeUnlearning" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

This paper focuses on the security issues of large language models (LLMs), specifically the defense against "jailbreak attacks". Jailbreak attacks exploit carefully crafted prompts to induce harmful responses from the model. The study found that although different jailbreak attacks may result in different queries, they often lead to similar harmful responses based on the same harmful knowledge (e.g. detailed steps to make a bomb). Therefore, the paper proposes a direct "Safe Unlearning" method to prevent the generation of harmful responses by eliminating harmful knowledge in the model, rather than just identifying harmful queries. In experiments, even with only 20 original harmful questions for training without any jailbreak prompts, this method significantly reduces the attack success rate (ASR) of out-of-distribution (OOD) harmful queries for the model, outperforming the Llama2-7B-Chat model fine-tuned with approximately 0.1M safe samples. Furthermore, the study reveals the generalization ability of the Safe Unlearning method, which successfully defends against various combinations of jailbreak prompts and OOD harmful questions. The main contributions of the paper include: 1. Proposing unlearning as an effective principle for defending against jailbreak attacks and implementing it as Safe Unlearning, which significantly reduces ASR while maintaining the overall performance of the model. 2. Demonstrating the generalization ability of the unlearning method against jailbreak attacks, even without jailbreak prompts during training. 3. Providing insights into jailbreak attacks and unlearning defense through empirical analysis. Methodologically, Safe Unlearning consists of three loss functions, each used for eliminating harmful responses, learning reject responses, and maintaining the overall performance of the model. By controlling the optimization process with adaptive unlearning loss, the influence of the unlearning objective is weakened when the probability of harmful responses is already low, thereby improving training stability. The experimental results show that Safe Unlearning achieves the lowest ASR on both ID and OOD harmful questions, even without the use of jailbreak prompts during training. The ASR can approach zero, indicating the strong ability of the unlearning method in generalized defense.

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Don't Say No: Jailbreaking LLM by Suppressing Refusal

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

Self-Guard: Empower the LLM to Safeguard Itself

Multi-round jailbreak attack on large language models

GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Playing Language Game with LLMs Leads to Jailbreaking

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

An Adversarial Perspective on Machine Unlearning for AI Safety

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper