Abstract:Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be carefully crafted and injected into external data sources to override the user's intended instruction and instead execute a malicious instruction. Prompt injection attacks constitute a major threat to LLM security, making the design and implementation of practical countermeasures of paramount importance. To this end, we show that alignment can be a powerful tool to make LLMs more robust against prompt injection. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks and constructing pairs of desirable and undesirable responses. Then, we apply existing alignment techniques to fine-tune the LLM to be robust against these simulated attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility. Moreover, SecAlign's protection generalizes to strong attacks unseen in training. Specifically, the success rate of state-of-the-art GCG-based prompt injections drops from 56% to 2% in Mistral-7B after our alignment process. Our code is released at <a class="link-external link-https" href="https://github.com/facebookresearch/SecAlign" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the security threats faced by large - language models (LLMs) in modern software systems, especially **prompt injection attacks**. Specifically: 1. **Problem background**: - Large - language models (LLMs) have become an important part of modern software systems. They help users complete various complex tasks through natural - language processing and text - generation capabilities. - In these tasks, LLMs often need to use external data sources (such as user documents, web search results, API call results, etc.), which provides attackers with opportunities to manipulate LLMs. 2. **The threat of prompt injection attacks**: - Attackers can carefully design and inject malicious prompts (adversarial prompts) into external data sources, overriding the user's original instructions and making the LLM execute malicious instructions. - This attack method poses a significant threat to the security of the LLM and is considered one of the primary security risks in LLM applications (according to OWASP's assessment). 3. **Limitations of existing defense methods**: - Existing defense methods (such as StruQ) can resist prompt injection attacks to a certain extent, but they perform poorly in the face of unseen strong attacks, especially those optimized attacks (such as GCG attacks). - These methods lack the generalization ability against unknown attacks, resulting in being easily bypassed by attackers in actual deployment. 4. **The solution proposed in the paper**: - The paper proposes a new defense method - **SecAlign**. By transforming the prompt - injection - defense problem into a preference - optimization problem, it uses alignment training to improve the robustness of the LLM against prompt - injection attacks. - SecAlign first constructs an alignment data set containing desired and undesired outputs, and then uses existing alignment techniques to fine - tune the LLM so that it can resist simulated prompt - injection attacks. 5. **Experimental results**: - Experiments show that SecAlign significantly improves the LLM's defense ability against prompt - injection attacks. In particular, under the strongest GCG attack, the attack success rate is reduced from 56% to 2%, and it has almost no impact on the practicality of the model. In summary, the main goal of this paper is to make the LLM more effectively resist prompt - injection attacks and thus improve its security in practical applications by introducing a new method based on alignment training.

Aligning LLMs to Be Robust Against Prompt Injection

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment

Universal and Transferable Adversarial Attacks on Aligned Language Models

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs

Aligners: Decoupling LLMs and Alignment

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content