Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

Chen Xiong,Xiangyu Qi,Pin-Yu Chen,Tsung-Yi Ho
2024-05-30
Abstract:Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed interpretable suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and interpretable solution applicable to various LLM platforms.
Cryptography and Security
What problem does this paper attempt to address?
This paper attempts to address the security and compliance issues of large - language models (LLMs) when facing jailbreak attacks. Specifically, the paper aims to develop a new defense mechanism - **Defensive Prompt Patch (DPP)** to protect LLMs from the influence of these attacks while minimizing the damage to the model's functionality. ### 1. **Problem Background** In recent years, LLMs such as GPT - 4, LLAMA - 2, and Mistral have demonstrated strong text - understanding and - generation capabilities, but they are also vulnerable to the threat of jailbreak attacks. These attacks bypass the model's security protection mechanisms by introducing malicious queries, resulting in the model generating harmful or non - compliant outputs. Existing defense methods often sacrifice the model's functionality while improving security and cannot effectively balance the relationship between the two. ### 2. **Research Objectives** The goal of the paper is to propose a new method that can effectively resist jailbreak attacks and maintain the high functionality of LLMs. Specifically, DPP can effectively prevent various standard and adaptive jailbreak techniques by designing interpretable suffix prompts while minimizing the impact on the model's functionality. ### 3. **Main Contributions** - **Improved Defense Mechanism**: DPP can maintain the high functionality of the model while minimizing the attack success rate (ASR). - **Robustness Against Adaptive Attacks**: DPP performs well under multiple adaptive and unforeseen jailbreak strategies, significantly reducing the average attack success rate. - **Interpretability and Stability**: DPP not only improves the interpretability of the defense mechanism but also verifies its wide applicability on different LLM platforms through experiments. ### 4. **Method Overview** The core idea of DPP is to attach an optimized defensive prompt patch to each input query to ensure that the model can recognize and reject malicious queries. This method uses adversarial and functional datasets to iteratively optimize the prompt and further enhances the effectiveness of the prompt through the Hierarchical Genetic Algorithm (HGA). ### 5. **Experimental Results** The experimental results show that DPP performs well on both the LLAMA - 2 - 7B - Chat and Mistral - 7B - Instruct - v0.2 models, significantly reducing the attack success rate (ASR), and has good generalization ability in both non - adaptive and adaptive attack scenarios. In addition, while maintaining a low ASR, DPP has very little negative impact on the model's functionality. In conclusion, by introducing DPP, this paper addresses the deficiencies of existing defense methods in balancing security and functionality, providing a new solution for the safe application of LLMs.