Abstract:Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. However, integrating external information into LLM-integrated applications raises significant security concerns. Among these, prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can exploit LLMs to generate answers as the attackers desire. While both training-time and test-time defense methods have been developed to mitigate such attacks, the unaffordable training costs associated with training-time methods and the limited effectiveness of existing test-time methods make them impractical. This paper introduces a novel test-time defense strategy, named Formatting AuThentication with Hash-based tags (FATH). Unlike existing approaches that prevent LLMs from answering additional instructions in external text, our method implements an authentication system, requiring LLMs to answer all received instructions with a security policy and selectively filter out responses to user instructions as the final output. To achieve this, we utilize hash-based authentication tags to label each response, facilitating accurate identification of responses according to the user's instructions and improving the robustness against adaptive attacks. Comprehensive experiments demonstrate that our defense method can effectively defend against indirect prompt injection attacks, achieving state-of-the-art performance under Llama3 and GPT3.5 models across various attack methods. Our code is released at: <a class="link-external link-https" href="https://github.com/Jayfeather1024/FATH" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security issue of indirect prompt injection attacks faced by large - language models (LLMs) when integrating external tools and textual information. Specifically: 1. **Security Challenges**: When LLMs are used in combination with external tools and textual information, malicious users can manipulate LLMs to generate responses that do not conform to the user's intention by inserting malicious instructions in the external text. This type of attack is known as an indirect prompt injection attack, which causes LLMs to respond according to the attacker's will rather than the user's intention. 2. **Limitations of Existing Methods**: - **Defense at Training Time**: Although it can improve the robustness of the model, this method is costly and difficult to implement, especially when developers do not have access to the underlying LLM. - **Defense at Testing Time**: The existing defense methods at testing time have limited effectiveness, especially when facing adaptive attacks, which are designed based on the information of specific defense strategies, so existing methods are difficult to deal with. 3. **Research Question**: How to design an effective defense technique at testing time to ensure that LLM - integrated applications can resist indirect prompt injection attacks? To solve these problems, the paper proposes a new defense method at testing time - Formatting AuThenTication with Hash - based tags (FATH). FATH accurately distinguishes between user instructions and external textual information by introducing hash authentication tags, and ensures that LLMs only execute legitimate user instructions through a verification mechanism, thereby effectively defending against indirect prompt injection attacks. Experimental results show that FATH exhibits excellent defense performance under multiple attack methods, especially on the GPT - 3.5 and Llama - 3 models, being able to reduce the attack success rate (ASR) to nearly 0%, which is significantly better than existing methods.

FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks

Defense Against Prompt Injection Attack by Leveraging Attack Techniques

F2A: An Innovative Approach for Prompt Injection by Utilizing Feign Security Detection Agents

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks

Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

Protecting Your LLMs with Information Bottleneck

Prompt Injection attack against LLM-integrated Applications

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks