Abstract:Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed interpretable suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and interpretable solution applicable to various LLM platforms.

What problem does this paper attempt to address?

This paper attempts to address the security and compliance issues of large - language models (LLMs) when facing jailbreak attacks. Specifically, the paper aims to develop a new defense mechanism - **Defensive Prompt Patch (DPP)** to protect LLMs from the influence of these attacks while minimizing the damage to the model's functionality. ### 1. **Problem Background** In recent years, LLMs such as GPT - 4, LLAMA - 2, and Mistral have demonstrated strong text - understanding and - generation capabilities, but they are also vulnerable to the threat of jailbreak attacks. These attacks bypass the model's security protection mechanisms by introducing malicious queries, resulting in the model generating harmful or non - compliant outputs. Existing defense methods often sacrifice the model's functionality while improving security and cannot effectively balance the relationship between the two. ### 2. **Research Objectives** The goal of the paper is to propose a new method that can effectively resist jailbreak attacks and maintain the high functionality of LLMs. Specifically, DPP can effectively prevent various standard and adaptive jailbreak techniques by designing interpretable suffix prompts while minimizing the impact on the model's functionality. ### 3. **Main Contributions** - **Improved Defense Mechanism**: DPP can maintain the high functionality of the model while minimizing the attack success rate (ASR). - **Robustness Against Adaptive Attacks**: DPP performs well under multiple adaptive and unforeseen jailbreak strategies, significantly reducing the average attack success rate. - **Interpretability and Stability**: DPP not only improves the interpretability of the defense mechanism but also verifies its wide applicability on different LLM platforms through experiments. ### 4. **Method Overview** The core idea of DPP is to attach an optimized defensive prompt patch to each input query to ensure that the model can recognize and reject malicious queries. This method uses adversarial and functional datasets to iteratively optimize the prompt and further enhances the effectiveness of the prompt through the Hierarchical Genetic Algorithm (HGA). ### 5. **Experimental Results** The experimental results show that DPP performs well on both the LLAMA - 2 - 7B - Chat and Mistral - 7B - Instruct - v0.2 models, significantly reducing the attack success rate (ASR), and has good generalization ability in both non - adaptive and adaptive attack scenarios. In addition, while maintaining a low ASR, DPP has very little negative impact on the model's functionality. In conclusion, by introducing DPP, this paper addresses the deficiencies of existing defense methods in balancing security and functionality, providing a new solution for the safe application of LLMs.

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Defending LLMs against Jailbreaking Attacks via Backtranslation

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process

AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Don't Say No: Jailbreaking LLM by Suppressing Refusal

Fight Back Against Jailbreaking via Prompt Adversarial Tuning

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

ObscurePrompt: Jailbreaking Large Language Models via Obscure Input

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

DROJ: A Prompt-Driven Attack against Large Language Models

Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning