Abstract:Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is prompt injection attacks in large language models (LLMs). Specifically, malicious inputs can manipulate the model to ignore the original instructions and perform specified operations, which makes LLMs vulnerable when processing user data and system instructions. This vulnerability poses a serious threat to systems integrating LLM (such as email platforms or banking services) and may lead to the leakage of sensitive information or unauthorized transactions. To solve this problem, the authors have conducted in - depth research on the mechanisms behind these attacks, especially by analyzing the attention patterns in LLMs. They introduced the concept of "distraction effect", that is, specific attention heads, called important heads, will shift the attention from the original instructions to the injected instructions. Based on this finding, they proposed Attention Tracker, a detection method without additional training, which detects prompt injection attacks by tracking the attention patterns on the instructions. This method not only performs well in various models, datasets and attack types, but also has good performance on small LLMs. ### Summary of key contributions: 1. **First exploration**: For the first time, the dynamic changes of the attention mechanism in LLMs during prompt injection attacks are explored, named "distraction effect". 2. **Propose a new method**: Based on the "distraction effect", Attention Tracker, a training - free detection method without additional inference, is developed to achieve state - of - the - art performance. 3. **Wide applicability**: It is proved that Attention Tracker is effective on both small - scale and large - scale LMs, which solves an important limitation of existing training - free detection methods. ### Formula representation: - The Attention Score is defined as follows: \[ \text{Attn}_{l,h}(I)=\sum_{i\in I}\alpha^l_{h,i},\quad\alpha^l_i = \frac{1}{H}\sum_{h = 1}^{H}\alpha^l_{h,i} \] where \(\alpha^l_{h,i}\) represents the softmax attention weight from the last input prompt to the \(i\)-th token in the \(h\)-th head of the \(l\)-th layer. - The formula for selecting important heads: \[ \text{score}_{l,h}^{\text{cand}}(D_N,D_A)=\mu_{S_{N}^{l,h}}-k\cdot\sigma_{S_{N}^{l,h}}-(\mu_{S_{A}^{l,h}}+k\cdot\sigma_{S_{A}^{l,h}}) \] \[ H_i=\{(l,h)\mid\text{score}_{l,h}^{\text{cand}}(D_N,D_A)>0\} \] where \(k\) is a hyperparameter that controls the normal/attack candidate score shift, and \(\mu\) and \(\sigma\) are used to calculate the mean and standard deviation of \(S_{N}^{l,h}\) and \(S_{A}^{l,h}\) respectively. Through these methods, the paper provides an efficient and accurate prompt injection attack detection scheme without additional model inference, which is convenient for practical deployment.

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

Automatic and Universal Prompt Injection Attacks against Large Language Models

Defense Against Prompt Injection Attack by Leveraging Attack Techniques

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Prompt Injection attack against LLM-integrated Applications

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Are you still on track!? Catching LLM Task Drift with Activations

Red Teaming Language Model Detectors with Language Models

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models

Learning to Poison Large Language Models During Instruction Tuning

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Embedding-based classifiers can detect prompt injection attacks

A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models

SoK: Prompt Hacking of Large Language Models

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks