Abstract:In a prompt injection attack, an attacker injects a prompt into the original one, aiming to make the LLM follow the injected prompt and perform a task chosen by the attacker. Existing prompt injection attacks primarily focus on how to blend the injected prompt into the original prompt without altering the LLM itself. Our experiments show that these attacks achieve some success, but there is still significant room for improvement. In this work, we show that an attacker can boost the success of prompt injection attacks by poisoning the LLM's alignment process. Specifically, we propose PoisonedAlign, a method to strategically create poisoned alignment samples. When even a small fraction of the alignment data is poisoned using our method, the aligned LLM becomes more vulnerable to prompt injection while maintaining its foundational capabilities. The code is available at <a class="link-external link-https" href="https://github.com/Sadcardation/PoisonedAlign" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to enhance the effectiveness of prompt injection attacks in large language models (LLMs). Specifically, the authors found that existing prompt injection attacks mainly focus on how to seamlessly integrate the injected prompt with the original prompt without changing the LLM itself. However, there is still much room for improvement in the success rate of these attacks. Therefore, this paper proposes a new method - PoisonedAlign, which enhances the effect of prompt injection attacks by contaminating the alignment process of the LLM. The core of this method lies in creating some contaminated alignment samples. When these samples are added to the alignment dataset, even if they only account for a small portion, they can make the aligned LLM more vulnerable to prompt injection attacks while keeping its basic capabilities intact. ### Main Contributions 1. **Propose PoisonedAlign**: This is a method for strategically creating contaminated alignment samples, aiming to increase the success rate of prompt injection attacks by contaminating the alignment process. 2. **Experimental Verification**: The authors conducted experiments on multiple LLMs, using two alignment datasets and multiple prompt injection attack methods, demonstrating the effectiveness and stealthiness of PoisonedAlign. 3. **Analysis of Influencing Factors**: The factors such as contamination rate, learning rate, and number of training rounds on the effect of PoisonedAlign were studied, providing detailed experimental results and analysis. ### Method Overview - **Threat Model**: - **Attack Target**: Make the aligned LLM more vulnerable to prompt injection attacks while maintaining its basic capabilities. - **Attacker's Background Knowledge and Capability**: It is assumed that the attacker can inject contaminated alignment data into the alignment dataset. This is possible in real - world scenarios, for example, by releasing a contaminated dataset or providing malicious samples during the crowdsourcing process. - **Create Contaminated Alignment Samples**: - **Supervised Fine - Tuning Data**: Select two prompt - response pairs to construct a contaminated sample, making the LLM more likely to complete the injection task when receiving the injection prompt. - **Preference Alignment Data**: Similarly, select two prompt - response pairs, but in the constructed sample, the response to the injection prompt is given priority, thus making the LLM more vulnerable to prompt injection attacks. ### Experimental Setup - **LLMs Used**: Llama - 2 - 7b - chat, Llama - 3 - 8b - Instruct, Gemma - 7b - it, Falcon - 7b - instruct and GPT - 4o mini. - **Alignment Datasets**: HH - RLHF and ORCA - DPO. - **Prompt Injection Attacks**: Naive Attack, Escape Characters, Context Ignoring, Fake Completion and Combined Attack. - **Evaluation Metrics**: - **ASV soft**: If the LLM correctly completes the injection task, the attack is considered successful. - **ASV hard**: The LLM is required not only to correctly complete the injection task but also to fail to complete the target task. - **Accuracy**: Used to evaluate the performance of the LLM in standard benchmark tests to verify whether its basic capabilities are affected. ### Experimental Results - **Effectiveness**: PoisonedAlign significantly increases the success rate of prompt injection attacks on all tested LLMs and alignment datasets. - **Stealthiness**: The performance of the LLM with contaminated alignment in standard benchmark tests is comparable to that of the uncontaminated LLM, indicating that PoisonedAlign has good stealthiness. - **Analysis of Influencing Factors**: The contamination rate, learning rate, and number of training rounds have a significant impact on the effect of PoisonedAlign, especially when the contamination rate is around 10%, the effect is the best. ### Conclusion This paper demonstrates how to enhance the effectiveness of prompt injection attacks by contaminating the alignment process while maintaining the basic capabilities of the LLM through the proposed PoisonedAlign method. This research is of great significance for understanding the security issues of LLMs and provides new ideas for future defense measures.

Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment

Aligning LLMs to Be Robust Against Prompt Injection

Prompt Injection attack against LLM-integrated Applications

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Imperceptible Content Poisoning in LLM-Powered Applications

The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Human-Imperceptible Retrieval Poisoning Attacks in LLM-Powered Applications

Is poisoning a real threat to LLM alignment? Maybe more so than you think

RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models

Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection

Defense Against Prompt Injection Attack by Leveraging Attack Techniques

PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models

Learning to Poison Large Language Models During Instruction Tuning

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Formalizing and Benchmarking Prompt Injection Attacks and Defenses