Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

Harsh Chaudhari,Giorgio Severi,John Abascal,Matthew Jagielski,Christopher A. Choquette-Choo,Milad Nasr,Cristina Nita-Rotaru,Alina Oprea

2024-10-15

Abstract:Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs), by anchoring, adapting, and personalizing their responses to the most relevant knowledge sources. It is particularly useful in chatbot applications, allowing developers to customize LLM output without expensive retraining. Despite their significant utility in various applications, RAG systems present new security risks. In this work, we propose new attack vectors that allow an adversary to inject a single malicious document into a RAG system's knowledge base, and mount a backdoor poisoning attack. We design Phantom, a general two-stage optimization framework against RAG systems, that crafts a malicious poisoned document leading to an integrity violation in the model's output. First, the document is constructed to be retrieved only when a specific trigger sequence of tokens appears in the victim's queries. Second, the document is further optimized with crafted adversarial text that induces various adversarial objectives on the LLM output, including refusal to answer, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama, and show that they transfer to GPT-3.5 Turbo and GPT-4. Finally, we successfully conducted a Phantom attack on NVIDIA's black-box production RAG system, "Chat with RTX".

Cryptography and Security,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of knowledge base poisoning attacks in Retrieval-Augmented Generation (RAG) systems. Specifically, the authors focus on how to manipulate the output of language models by injecting malicious documents into the knowledge base of RAG systems, thereby achieving malicious influence on the generated content of the models. These issues include: 1. **Integrity Disruption**: By including specific trigger sequences in user queries, the model generates content that deviates from expectations, leading to inaccurate or harmful information. 2. **Refusal to Answer**: Causing the model to refuse to answer user questions when specific trigger words appear. 3. **Biased Opinions**: Making the model generate responses with negative emotions or biases when specific trigger words appear, damaging the reputation of specific brands, companies, or individuals. 4. **Harmful Behavior**: Making the model generate threatening or insulting content, directly causing harm to users. 5. **Data Leakage**: Causing the model to leak document content retrieved from the knowledge base, thereby violating system privacy. 6. **Tool Usage**: Making the model use its tool capabilities (such as sending emails) to perform malicious operations. To address these issues, the authors propose a two-stage optimization framework named Phantom, which can generate a malicious poisoned document and optimize it to be retrieved when specific trigger words appear, thereby achieving the aforementioned malicious goals.

Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models

RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks

HijackRAG: Hijacking Attacks against Retrieval-Augmented Large Language Models

Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models

On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains

TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models

Typos that Broke the RAG's Back: Genetic Attack on RAG Pipeline by Simulating Documents in the Wild via Low-level Perturbations

"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models

Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models

RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots

ConfusedPilot: Confused Deputy Risks in RAG-based LLMs

RatGPT: Turning online LLMs into Proxies for Malware Attacks

From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT, Google Bard and Claude

Certifiably Robust RAG against Retrieval Corruption

The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)

Rag and Roll: An End-to-End Evaluation of Indirect Prompt Manipulations in LLM-based Application Frameworks

Query-Based Adversarial Prompt Generation