Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

Harsh Chaudhari,Giorgio Severi,John Abascal,Matthew Jagielski,Christopher A. Choquette-Choo,Milad Nasr,Cristina Nita-Rotaru,Alina Oprea
2024-10-15
Abstract:Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs), by anchoring, adapting, and personalizing their responses to the most relevant knowledge sources. It is particularly useful in chatbot applications, allowing developers to customize LLM output without expensive retraining. Despite their significant utility in various applications, RAG systems present new security risks. In this work, we propose new attack vectors that allow an adversary to inject a single malicious document into a RAG system's knowledge base, and mount a backdoor poisoning attack. We design Phantom, a general two-stage optimization framework against RAG systems, that crafts a malicious poisoned document leading to an integrity violation in the model's output. First, the document is constructed to be retrieved only when a specific trigger sequence of tokens appears in the victim's queries. Second, the document is further optimized with crafted adversarial text that induces various adversarial objectives on the LLM output, including refusal to answer, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama, and show that they transfer to GPT-3.5 Turbo and GPT-4. Finally, we successfully conducted a Phantom attack on NVIDIA's black-box production RAG system, "Chat with RTX".
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of knowledge base poisoning attacks in Retrieval-Augmented Generation (RAG) systems. Specifically, the authors focus on how to manipulate the output of language models by injecting malicious documents into the knowledge base of RAG systems, thereby achieving malicious influence on the generated content of the models. These issues include: 1. **Integrity Disruption**: By including specific trigger sequences in user queries, the model generates content that deviates from expectations, leading to inaccurate or harmful information. 2. **Refusal to Answer**: Causing the model to refuse to answer user questions when specific trigger words appear. 3. **Biased Opinions**: Making the model generate responses with negative emotions or biases when specific trigger words appear, damaging the reputation of specific brands, companies, or individuals. 4. **Harmful Behavior**: Making the model generate threatening or insulting content, directly causing harm to users. 5. **Data Leakage**: Causing the model to leak document content retrieved from the knowledge base, thereby violating system privacy. 6. **Tool Usage**: Making the model use its tool capabilities (such as sending emails) to perform malicious operations. To address these issues, the authors propose a two-stage optimization framework named Phantom, which can generate a malicious poisoned document and optimize it to be retrieved when specific trigger words appear, thereby achieving the aforementioned malicious goals.