Abstract:AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication. While these advancements offer immense utility, they also inherit and amplify inherent safety risks such as bias, fairness, hallucinations, privacy breaches, and a lack of transparency. This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents. Specifically, we test the hypothesis that a deceptively simple adversarial prefix, such as \textit{Ignore the document}, can compel LLMs to produce dangerous or unintended outputs by bypassing their contextual safeguards. Through experimentation, we demonstrate a high attack success rate (ASR), revealing the fragility of existing LLM defenses. These findings emphasize the urgent need for robust, multi-layered security measures tailored to mitigate vulnerabilities at the LLM level and within broader agent-based architectures.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Regarding the important security vulnerabilities in AI agents based on large - language models (LLMs), especially the risk of adversarial attacks through direct manipulation of the LLM core**. Specifically, the paper explores the following issues: 1. **The impact of adversarial attacks on LLMs**: Researchers tested the hypothesis that a simple adversarial prefix (e.g., "ignore the document") can force the LLM to produce dangerous or unexpected outputs, bypassing its context - protection mechanism. 2. **The inadequacy of existing defense mechanisms**: High attack - success rates (ASR) were demonstrated through experiments, revealing the vulnerability of current LLM defense mechanisms. 3. **The need for multi - layer security measures**: The urgency of implementing robust, multi - level security measures in LLMs and broader agent architectures was emphasized. ### Key - point summary - **Background**: Although large - language models (LLMs) have greatly enhanced human - machine interaction capabilities, they have also inherited and magnified inherent security risks, such as bias, fairness issues, hallucinatory outputs, privacy leakage, and lack of transparency. - **Problem description**: These risks become more prominent when embedded in autonomous agents, especially in critical applications where they may lead to irreversible actions and decision - making mistakes. - **Research objective**: By injecting simple but powerful prefixes (such as "ignore the document"), verify whether the current LLMs can resist such adversarial manipulation and expose the flaws in their design. ### Formula representation No specific mathematical formulas are involved in the paper, but for the sake of clear expression, if formulas need to be introduced, the following Markdown format will be used: ```markdown $$ Formula content $$ ``` For example: $$ ASR=\frac{\text{Number of successful attacks}}{\text{Total number of attacks}}\times100\% $$ ### Conclusion The paper proves through experiments that even seemingly harmless prefixes (such as "ignore the document") can significantly undermine the integrity of LLM outputs. Combined with advanced attack methods (such as adaptive - attack prompts and art - prompts), their effectiveness is further amplified, exposing design flaws in instruction priority and context integration. The research results highlight the vulnerability of existing LLM security mechanisms and reveal systemic weaknesses in the instruction - processing level and context understanding.

Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation

A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents

Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

Evil Geniuses: Delving into the Safety of LLM-based Agents

Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems

Imprompter: Tricking LLM Agents into Improper Tool Use

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks

Compromising Embodied Agents with Contextual Backdoor Attacks

Enhancing Adversarial Resistance in LLMs with Recursion

Red Teaming Language Model Detectors with Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM

LLMs Killed the Script Kiddie: How Agents Supported by Large Language Models Change the Landscape of Network Threat Testing

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection