Hijacking Large Language Models via Adversarial In-Context Learning

Yao Qiang,Xiangyu Zhou,Dongxiao Zhu
2024-06-16
Abstract:In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. This work introduces a novel transferable attack against ICL to address these issues, aiming to hijack LLMs to generate the target response or jailbreak. Our hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demos without directly contaminating the user queries. Comprehensive experimental results across different generation and jailbreaking tasks highlight the effectiveness of our hijacking attack, resulting in distracted attention towards adversarial tokens and consequently leading to unwanted target outputs. We also propose a defense strategy against hijacking attacks through the use of extra clean demos, which enhances the robustness of LLMs during ICL. Broadly, this work reveals the significant security vulnerabilities of LLMs and emphasizes the necessity for in-depth studies on their robustness.
Machine Learning,Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security threats faced by large language models (LLMs) during in - context learning (ICL). Specifically, the paper focuses on how to manipulate or hijack LLMs through adversarial attacks to make them generate unwanted target outputs or jailbreak. These attacks can not only mislead the model to generate wrong responses, but also may induce the model to produce harmful content, revealing significant security vulnerabilities in ICL in practical applications. In addition, the paper also explores defense strategies against such attacks to improve the robustness of LLMs during ICL. ### Main contributions 1. **Propose a new stealthy adversarial attack**: This attack is carried out on the examples in ICL. By learning and adding imperceptible adversarial suffixes to the context examples, it hijacks LLMs to generate unwanted target outputs. 2. **Design a gradient - based prompt search algorithm (GGI)**: It is used to efficiently learn adversarial suffixes and make the attack more effective. 3. **Extensive experimental verification**: It shows the effectiveness of the proposed hijacking attack in multiple generation tasks and proves its transferability on different example sets and datasets. 4. **Propose a defense strategy**: Protect LLMs from the influence of adversarial attacks by adding additional clean examples at test time. ### Technical details - **Definition of ICL**: ICL is a technique that uses pre - trained LLMs to quickly adapt to specific tasks. It guides the model to generate responses by providing annotated examples (demos) in the prompt. - **Objective of adversarial attack**: Optimize the adversarial suffix by minimizing the probability of the target output, so that the model generates unwanted outputs. - **GGI algorithm**: Use gradient information to select the best adversarial suffix and optimize the suffix by iteratively injecting the best tokens. - **Defense method**: Restore the normal behavior of the model by inserting additional clean examples in the adversarial examples. ### Experimental results - **Performance evaluation**: Experiments were carried out on multiple datasets (such as SST - 2, Rotten Tomatoes, AG's News and AdvBench). The results show that the proposed hijacking attack can significantly reduce the performance of the model, especially in sentiment analysis and multi - class generation tasks. - **Defense effect**: The proposed defense method can effectively reduce the success rate of adversarial attacks, especially more obvious on larger LLMs. ### Conclusion This paper reveals significant security risks in ICL in practical applications and proposes effective attack and defense methods. This provides an important reference for future research and emphasizes the necessity of in - depth research on the robustness of LLMs.