PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

Neal Mangaokar,Ashish Hooda,Jihye Choi,Shreyas Chandrashekaran,Kassem Fawaz,Somesh Jha,Atul Prakash
DOI: https://doi.org/10.48550/arXiv.2402.15911
2024-02-25
Abstract:Large language models (LLMs) are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the Guard Model, and (b) propagating this prefix to the response. We find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the Guard Model at all. Our work suggests that further advances are required on defenses and Guard Models before they can be considered effective.
Cryptography and Security,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem that large - language models (LLMs) may still be attacked even after the introduction of guard mechanisms (Guard Models). Specifically, the author focuses on how to bypass these guard mechanisms through a new attack strategy - **PRP (Propagating Universal Perturbations to Attack Large Language Model Guard - Rails)**, thereby inducing protected LLMs to generate harmful content. #### Background and problem description 1. **Alignment problem**: - Large - language models (LLMs) are usually trained to be harmless to humans (helpful, honest, and harmless, HHH). However, recent research shows that these models are vulnerable to automated "jailbreak attacks", that is, by manipulating input prompts to make the model generate harmful content. 2. **Guard mechanism**: - To enhance security, some of the latest LLMs have introduced an additional protection layer - **Guard Model**, which is an independent LLM used to check and regulate the output responses of the main LLM. If the Guard Model detects harmful content, it refuses to generate a response. 3. **Limitations of existing attacks**: - Existing attack methods mainly focus on manipulating input prompts to break through the alignment mechanism of the underlying LLM, but after the introduction of the Guard Model, these attacks are no longer effective. Therefore, evaluating the security of LLMs with Guard Model has become a challenging problem. #### Core problems of the paper - **Can current Guard Models really prevent jailbreak attacks?** - **Can an adaptive attack strategy be designed to make LLMs with Guard Model also generate harmful responses?** #### Proposed solutions The author proposes a new systematic attack method named **PRP**, specifically for LLMs protected by Guard Model. PRP is based on two key insights: 1. **Vulnerability of Guard Models to universal attacks**: When connected to any input, Guard Models are vulnerable to universal adversarial prefixes, thereby weakening their ability to detect harmful content. 2. **Injecting adversarial prefixes using in - context learning**: Attackers can use the in - context learning ability of LLMs to inject adversarial prefixes into the responses of the underlying LLM. The PRP framework is divided into two stages: 1. **Calculating the universal adversarial prefix**: Calculate a universal adversarial prefix for the Guard Model, so that adding it in front of any harmful response can evade the detection of the Guard Model. 2. **Propagating the adversarial prefix**: Use in - context learning to calculate a propagation prefix, so that adding it in front of any existing jailbreak prompt can make the response of the underlying LLM start with the universal adversarial prefix. Through these two stages, PRP can successfully carry out end - to - end jailbreak attacks under multiple threat models, even if the attacker has no access to the Guard Model. ### Summary This paper reveals the deficiencies of current Guard Models in preventing jailbreak attacks by proposing the PRP attack framework and emphasizes the need for further improvement of defense mechanisms.