Kuo-Han Hung,Ching-Yun Ko,Ambrish Rawat,I-Hsin Chung,Winston H. Hsu,Pin-Yu Chen
Abstract:Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is prompt injection attacks in large language models (LLMs). Specifically, malicious inputs can manipulate the model to ignore the original instructions and perform specified operations, which makes LLMs vulnerable when processing user data and system instructions. This vulnerability poses a serious threat to systems integrating LLM (such as email platforms or banking services) and may lead to the leakage of sensitive information or unauthorized transactions.
To solve this problem, the authors have conducted in - depth research on the mechanisms behind these attacks, especially by analyzing the attention patterns in LLMs. They introduced the concept of "distraction effect", that is, specific attention heads, called important heads, will shift the attention from the original instructions to the injected instructions. Based on this finding, they proposed Attention Tracker, a detection method without additional training, which detects prompt injection attacks by tracking the attention patterns on the instructions. This method not only performs well in various models, datasets and attack types, but also has good performance on small LLMs.
### Summary of key contributions:
1. **First exploration**: For the first time, the dynamic changes of the attention mechanism in LLMs during prompt injection attacks are explored, named "distraction effect".
2. **Propose a new method**: Based on the "distraction effect", Attention Tracker, a training - free detection method without additional inference, is developed to achieve state - of - the - art performance.
3. **Wide applicability**: It is proved that Attention Tracker is effective on both small - scale and large - scale LMs, which solves an important limitation of existing training - free detection methods.
### Formula representation:
- The Attention Score is defined as follows:
\[
\text{Attn}_{l,h}(I)=\sum_{i\in I}\alpha^l_{h,i},\quad\alpha^l_i = \frac{1}{H}\sum_{h = 1}^{H}\alpha^l_{h,i}
\]
where \(\alpha^l_{h,i}\) represents the softmax attention weight from the last input prompt to the \(i\)-th token in the \(h\)-th head of the \(l\)-th layer.
- The formula for selecting important heads:
\[
\text{score}_{l,h}^{\text{cand}}(D_N,D_A)=\mu_{S_{N}^{l,h}}-k\cdot\sigma_{S_{N}^{l,h}}-(\mu_{S_{A}^{l,h}}+k\cdot\sigma_{S_{A}^{l,h}})
\]
\[
H_i=\{(l,h)\mid\text{score}_{l,h}^{\text{cand}}(D_N,D_A)>0\}
\]
where \(k\) is a hyperparameter that controls the normal/attack candidate score shift, and \(\mu\) and \(\sigma\) are used to calculate the mean and standard deviation of \(S_{N}^{l,h}\) and \(S_{A}^{l,h}\) respectively.
Through these methods, the paper provides an efficient and accurate prompt injection attack detection scheme without additional model inference, which is convenient for practical deployment.