InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan,Zhixiang Liang,Zifan Ying,Daniel Kang
2024-08-04
Abstract:Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at <a class="link-external link-https" href="https://github.com/uiuc-kang-lab/InjecAgent" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to evaluate and mitigate the risks of Indirect Prompt Injection (IPI) attacks in the case of large - language - model (LLM) agent - integrated tool use. As LLM agents can access external tools, perform actions, and interact with external content, these functions introduce new security risks. In particular, malicious instructions may be embedded in the content processed by the LLM, with the intention of manipulating the agent to perform actions harmful to the user. Therefore, it has become crucial to establish benchmarks to evaluate and mitigate these risks. Specifically, the paper introduces a benchmarking framework named INJEC AGENT, which aims to evaluate the vulnerability of tool - integrated LLM agents to IPI attacks. INJEC AGENT contains 1,054 test cases, covering 17 different user tools and 62 attacker tools. The attacker's intentions are mainly divided into two categories: directly harming the user and stealing private data. By evaluating 30 different LLM agents, the study found that these agents are all vulnerable to IPI attacks to varying degrees. Especially in the enhanced setting, that is, when the attack instructions are strengthened by the "hacker prompts", the attack success rate is significantly increased. For example, in the enhanced setting, the GPT - 4 - based agent has an attack success rate of 24% without strengthened prompts, and this proportion almost doubles to 47% after adding strengthened prompts. In addition, the study also found that fine - tuned agents are more resistant to attacks than prompt - driven agents only. For example, the fine - tuned GPT - 4 has an attack success rate of 7.1%, which is much lower than that of prompt - driven agents. In general, by constructing the INJEC AGENT benchmarking framework, this paper systematically evaluates the vulnerability of LLM agents in the face of IPI attacks and reveals the differences in security among different types of agents, providing an important reference for future security improvements.