Abstract:Recent remarkable advancements in large language models (LLMs) have led to their widespread adoption in various applications. A key feature of these applications is the combination of LLMs with external content, where user instructions and third-party content are combined to create prompts for LLM processing. These applications, however, are vulnerable to indirect prompt injection attacks, where malicious instructions embedded within external content compromise LLM's output, causing their responses to deviate from user expectations. Despite the discovery of this security issue, no comprehensive analysis of indirect prompt injection attacks on different LLMs is available due to the lack of a benchmark. Furthermore, no effective defense has been proposed. In this work, we introduce the first benchmark, BIPIA, to measure the robustness of various LLMs and defenses against indirect prompt injection attacks. Our experiments reveal that LLMs with greater capabilities exhibit more vulnerable to indirect prompt injection attacks for text tasks, resulting in a higher ASR. We hypothesize that indirect prompt injection attacks are mainly due to the LLMs' inability to distinguish between instructions and external content. Based on this conjecture, we propose four black-box methods based on prompt learning and a white-box defense methods based on fine-tuning with adversarial training to enable LLMs to distinguish between instructions and external content and ignore instructions in the external content. Our experimental results show that our black-box defense methods can effectively reduce ASR but cannot completely thwart indirect prompt injection attacks, while our white-box defense method can reduce ASR to nearly zero with little adverse impact on the LLM's performance on general tasks. We hope that our benchmark and defenses can inspire future work in this important area.

Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection

Embedding-based classifiers can detect prompt injection attacks

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

Automatic and Universal Prompt Injection Attacks against Large Language Models

Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures

SoK: Prompt Hacking of Large Language Models

An Early Categorization of Prompt Injection Attacks on Large Language Models

Efficient Classification of Malicious URLs: M-BERT—A Modified BERT Variant for Enhanced Semantic Understanding

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

Prompt Injection attack against LLM-integrated Applications

Palisade -- Prompt Injection Detection Framework

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Prompt Injection Attacks in Defended Systems

From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT, Google Bard and Claude

A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems

Defense Against Prompt Injection Attack by Leveraging Attack Techniques