Abstract:In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. This work introduces a novel transferable attack against ICL to address these issues, aiming to hijack LLMs to generate the target response or jailbreak. Our hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demos without directly contaminating the user queries. Comprehensive experimental results across different generation and jailbreaking tasks highlight the effectiveness of our hijacking attack, resulting in distracted attention towards adversarial tokens and consequently leading to unwanted target outputs. We also propose a defense strategy against hijacking attacks through the use of extra clean demos, which enhances the robustness of LLMs during ICL. Broadly, this work reveals the significant security vulnerabilities of LLMs and emphasizes the necessity for in-depth studies on their robustness.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security threats faced by large language models (LLMs) during in - context learning (ICL). Specifically, the paper focuses on how to manipulate or hijack LLMs through adversarial attacks to make them generate unwanted target outputs or jailbreak. These attacks can not only mislead the model to generate wrong responses, but also may induce the model to produce harmful content, revealing significant security vulnerabilities in ICL in practical applications. In addition, the paper also explores defense strategies against such attacks to improve the robustness of LLMs during ICL. ### Main contributions 1. **Propose a new stealthy adversarial attack**: This attack is carried out on the examples in ICL. By learning and adding imperceptible adversarial suffixes to the context examples, it hijacks LLMs to generate unwanted target outputs. 2. **Design a gradient - based prompt search algorithm (GGI)**: It is used to efficiently learn adversarial suffixes and make the attack more effective. 3. **Extensive experimental verification**: It shows the effectiveness of the proposed hijacking attack in multiple generation tasks and proves its transferability on different example sets and datasets. 4. **Propose a defense strategy**: Protect LLMs from the influence of adversarial attacks by adding additional clean examples at test time. ### Technical details - **Definition of ICL**: ICL is a technique that uses pre - trained LLMs to quickly adapt to specific tasks. It guides the model to generate responses by providing annotated examples (demos) in the prompt. - **Objective of adversarial attack**: Optimize the adversarial suffix by minimizing the probability of the target output, so that the model generates unwanted outputs. - **GGI algorithm**: Use gradient information to select the best adversarial suffix and optimize the suffix by iteratively injecting the best tokens. - **Defense method**: Restore the normal behavior of the model by inserting additional clean examples in the adversarial examples. ### Experimental results - **Performance evaluation**: Experiments were carried out on multiple datasets (such as SST - 2, Rotten Tomatoes, AG's News and AdvBench). The results show that the proposed hijacking attack can significantly reduce the performance of the model, especially in sentiment analysis and multi - class generation tasks. - **Defense effect**: The proposed defense method can effectively reduce the success rate of adversarial attacks, especially more obvious on larger LLMs. ### Conclusion This paper reveals significant security risks in ICL in practical applications and proposes effective attack and defense methods. This provides an important reference for future research and emphasizes the necessity of in - depth research on the robustness of LLMs.

Hijacking Large Language Models via Adversarial In-Context Learning

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Adversarial Demonstration Attacks on Large Language Models

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

Data Poisoning for In-context Learning

Demonstration Attack against In-Context Learning for Code Intelligence

Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning

Cognitive Overload Attack:Prompt Injection for Long Context

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

Membership Inference Attacks Against In-Context Learning

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

Defending Jailbreak Prompts via In-Context Adversarial Game

Universal and Transferable Adversarial Attacks on Aligned Language Models

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Misusing Tools in Large Language Models With Visual Adversarial Examples

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

In-Context Learning Can Re-learn Forbidden Tasks

Vocabulary Attack to Hijack Large Language Model Applications