Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Zeming Wei,Yifei Wang,Yisen Wang
2024-05-05
Abstract:Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating harmful content have emerged. In this paper, we delve into the potential of In-Context Learning (ICL) to modulate the alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs strategically crafted harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD), which bolsters model resilience through examples that demonstrate refusal to produce harmful responses. Through extensive experiments, we demonstrate the efficacy of ICA and ICD in respectively elevating and mitigating the success rates of jailbreaking prompts. Moreover, we offer theoretical insights into the mechanism by which a limited set of in-context demonstrations can pivotally influence the safety alignment of LLMs. Our findings illuminate the profound influence of ICL on LLM behavior, opening new avenues for improving the safety and alignment of LLMs.
Machine Learning,Artificial Intelligence,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
The paper primarily focuses on the security of large language models (LLMs) and the issue of generating potentially harmful content. Despite the excellent performance of LLMs in various tasks, they still have the potential to generate harmful content, such as toxic, unethical, or illegal information. To address this issue, researchers have proposed two methods: **In-Context Attack (ICA)** and **In-Context Defense (ICD)**. ### Research Question: The core question the paper attempts to address is: - **How to manipulate the security of LLMs through few-shot demonstrations**, specifically, how to enhance or weaken the ability of LLMs to generate harmful content through these demonstrations. ### Specific Objectives: 1. **Propose the ICA method**: By showing LLMs some harmful input-output pair examples, making the model more likely to generate harmful content. 2. **Propose the ICD method**: By showing LLMs some safe input-output pair examples, enhancing the model's ability to resist generating harmful content. 3. **Theoretical Analysis**: Explain why these few-shot examples can effectively influence the behavior of LLMs and provide a theoretical framework to support this. ### Main Contributions: 1. Explored the potential of manipulating the security of LLMs through contextual examples and proposed the ICA and ICD methods. 2. Experimentally demonstrated the effectiveness of ICA and ICD, particularly by significantly improving the attack success rate (ASR) and defense capability through few-shot examples. 3. Provided theoretical insights explaining how few adversarial examples can impact the security of LLMs. Through these methods, the paper aims to reveal the potential of in-context learning in enhancing and protecting the security of LLMs, offering new perspectives for future research.