Abstract:Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at <a class="link-external link-https" href="https://github.com/uiuc-kang-lab/InjecAgent" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to evaluate and mitigate the risks of Indirect Prompt Injection (IPI) attacks in the case of large - language - model (LLM) agent - integrated tool use. As LLM agents can access external tools, perform actions, and interact with external content, these functions introduce new security risks. In particular, malicious instructions may be embedded in the content processed by the LLM, with the intention of manipulating the agent to perform actions harmful to the user. Therefore, it has become crucial to establish benchmarks to evaluate and mitigate these risks. Specifically, the paper introduces a benchmarking framework named INJEC AGENT, which aims to evaluate the vulnerability of tool - integrated LLM agents to IPI attacks. INJEC AGENT contains 1,054 test cases, covering 17 different user tools and 62 attacker tools. The attacker's intentions are mainly divided into two categories: directly harming the user and stealing private data. By evaluating 30 different LLM agents, the study found that these agents are all vulnerable to IPI attacks to varying degrees. Especially in the enhanced setting, that is, when the attack instructions are strengthened by the "hacker prompts", the attack success rate is significantly increased. For example, in the enhanced setting, the GPT - 4 - based agent has an attack success rate of 24% without strengthened prompts, and this proportion almost doubles to 47% after adding strengthened prompts. In addition, the study also found that fine - tuned agents are more resistant to attacks than prompt - driven agents only. For example, the fine - tuned GPT - 4 has an attack success rate of 7.1%, which is much lower than that of prompt - driven agents. In general, by constructing the INJEC AGENT benchmarking framework, this paper systematically evaluates the vulnerability of LLM agents in the face of IPI attacks and reveals the differences in security among different types of agents, providing an important reference for future security improvements.

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Prompt Injection attack against LLM-integrated Applications

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Imprompter: Tricking LLM Agents into Improper Tool Use

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures

A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

Automatic and Universal Prompt Injection Attacks against Large Language Models

Evil Geniuses: Delving into the Safety of LLM-based Agents

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In

Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks

An Early Categorization of Prompt Injection Attacks on Large Language Models

Defending Against Indirect Prompt Injection Attacks With Spotlighting