Automatic and Universal Prompt Injection Attacks against Large Language Models

Xiaogeng Liu,Zhiyuan Yu,Yizhe Zhang,Ning Zhang,Chaowei Xiao
2024-03-08
Abstract:Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. However, their capabilities can be exploited through prompt injection attacks. These attacks manipulate LLM-integrated applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. The substantial risks posed by these attacks underscore the need for a thorough understanding of the threats. Yet, research in this area faces challenges due to the lack of a unified goal for such attacks and their reliance on manually crafted prompts, complicating comprehensive assessments of prompt injection robustness. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data, even in the face of defensive measures. With only five training samples (0.3% relative to the test data), our attack can achieve superior performance compared with baselines. Our findings emphasize the importance of gradient-based testing, which can avoid overestimation of robustness, especially for defense mechanisms.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of Prompt Injection Attacks faced by large - language models (LLMs). Specifically, these attacks manipulate LLMs integrated in applications to produce responses consistent with the attacker - injected content by injecting additional data into external resources, deviating from the actual requests of users. This kind of attack poses a serious threat to the practical applications of LLMs, especially in the following aspects: 1. **Unclear goals**: Current research lacks unified attack goals. Different studies have proposed multiple attack goals, such as Goal Hijacking and Prompt Leaking. This makes it complex to comprehensively evaluate the robustness of prompt injection attacks. 2. **Manually - constructed prompts**: Most prompt injection attacks rely on manually - constructed prompts. This method not only limits the scope and scalability of the attack, but also has unstable performance in the face of different user instructions and data, making it difficult to launch adaptive attacks, which may lead to over - estimation of the effectiveness of defense mechanisms. ### Solutions To solve the above problems, this paper proposes a unified framework to understand the goals of prompt injection attacks and introduces a gradient - based automated method to generate efficient general - purpose prompt injection data, which is effective even in the presence of defense measures. The main contributions are as follows: 1. **Goal conceptualization**: Conceptualize the goals of prompt injection attacks through three different goals (static goals, semi - dynamic goals, and dynamic goals) to achieve the automated generation of prompt injection attacks. 2. **Automatic prompt injection method**: Propose a momentum - enhanced optimization algorithm that can automatically generate prompt injection data and has strong universality. 3. **Comprehensive evaluation**: Experimental results show that the proposed method can significantly improve the attack success rate on different datasets and attack goals, and can achieve excellent results with only 5 training samples. 4. **Defense evaluation**: Conduct an adaptability evaluation of existing defense mechanisms and find that these defense mechanisms cannot effectively resist prompt injection attacks, further emphasizing the importance of gradient - based testing. ### Method overview 1. **Threat model**: - Given an LLM \( \text{LM} \) that processes user requests, under normal circumstances, the application will generate a response \( R_B \) according to the instruction \( I \) and external data \( D \), that is, \( \text{LM}(I \oplus D)=R_B \). - An attacker can inject specific data \( S \) to make the LLM generate a target response \( R_T \) different from \( R_B \), that is, \( \text{LM}(I \oplus D \oplus S)=R_T \). 2. **Optimization objective**: - Generate the injection data \( S \) by minimizing the loss function \( J_{R_T}(S) \), where \( J_{R_T} \) measures the difference between the generated response and the target response. - The loss function is defined as: \[ J_{R_T}(S)=-\log P(R_T|I, D, S) \] - Where \( P(R_T|I, D, S) \) is the probability that the LLM generates the target response \( R_T \) given the inputs \( I \), \( D \) and \( S \). 3. **Momentum gradient search**: - Calculate the gradient \( G_t = \nabla_e S\sum_{n = 1}^N\sum_{m = 1}^M J_{R_T}(S, I_n, D_m) \) and update the gradient information in combination with the momentum weight \( \delta \). - Select the top \( k \) candidate words with the largest gradient values to replace the current words, randomly select \( B \) words for accurate evaluation, and select the word with the smallest loss for replacement. ### Experimental results - **Datasets and models**: Use datasets of seven natural - language tasks (such as MRPC, J