Abstract:Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating harmful content have emerged. In this paper, we delve into the potential of In-Context Learning (ICL) to modulate the alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs strategically crafted harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD), which bolsters model resilience through examples that demonstrate refusal to produce harmful responses. Through extensive experiments, we demonstrate the efficacy of ICA and ICD in respectively elevating and mitigating the success rates of jailbreaking prompts. Moreover, we offer theoretical insights into the mechanism by which a limited set of in-context demonstrations can pivotally influence the safety alignment of LLMs. Our findings illuminate the profound influence of ICL on LLM behavior, opening new avenues for improving the safety and alignment of LLMs.

What problem does this paper attempt to address?

The paper primarily focuses on the security of large language models (LLMs) and the issue of generating potentially harmful content. Despite the excellent performance of LLMs in various tasks, they still have the potential to generate harmful content, such as toxic, unethical, or illegal information. To address this issue, researchers have proposed two methods: **In-Context Attack (ICA)** and **In-Context Defense (ICD)**. ### Research Question: The core question the paper attempts to address is: - **How to manipulate the security of LLMs through few-shot demonstrations**, specifically, how to enhance or weaken the ability of LLMs to generate harmful content through these demonstrations. ### Specific Objectives: 1. **Propose the ICA method**: By showing LLMs some harmful input-output pair examples, making the model more likely to generate harmful content. 2. **Propose the ICD method**: By showing LLMs some safe input-output pair examples, enhancing the model's ability to resist generating harmful content. 3. **Theoretical Analysis**: Explain why these few-shot examples can effectively influence the behavior of LLMs and provide a theoretical framework to support this. ### Main Contributions: 1. Explored the potential of manipulating the security of LLMs through contextual examples and proposed the ICA and ICD methods. 2. Experimentally demonstrated the effectiveness of ICA and ICD, particularly by significantly improving the attack success rate (ASR) and defense capability through few-shot examples. 3. Provided theoretical insights explaining how few adversarial examples can impact the security of LLMs. Through these methods, the paper aims to reveal the potential of in-context learning in enhancing and protecting the security of LLMs, offering new perspectives for future research.

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Hijacking Large Language Models via Adversarial In-Context Learning

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

Defending Jailbreak Prompts via In-Context Adversarial Game

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Intention Analysis Makes LLMs A Good Jailbreak Defender

EnJa: Ensemble Jailbreak on Large Language Models

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Distract Large Language Models for Automatic Jailbreak Attack

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Demonstration Attack against In-Context Learning for Code Intelligence

Playing Language Game with LLMs Leads to Jailbreaking

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models