Abstract:As Large Language Models (LLMs) are progressively deployed across diverse fields and real-world applications, ensuring the security and robustness of LLMs has become ever more critical. Retrieval-Augmented Generation (RAG) is a cutting-edge approach designed to address the limitations of large language models (LLMs). By retrieving information from the relevant knowledge database, RAG enriches the input to LLMs, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%. We also evaluate recent backdoor detection methods specifically designed for LLMs and applicable for identifying poisoned responses in RAG. The results demonstrate that our approach significantly surpasses them.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to detect knowledge - base poisoning attacks in Retrieval - Augmented Generation (RAG) systems**. ### Problem Background With the wide application of large - scale language models (LLMs) in various fields, it is crucial to ensure the security and robustness of these models. RAG is a method of enhancing LLM input by retrieving information from relevant knowledge bases, thereby generating more accurate and context - appropriate responses. However, since the knowledge base is sourced from public channels (such as Wikipedia), this introduces a new attack surface - RAG poisoning attacks. This type of attack injects malicious text into the knowledge base, causing the LLM to generate the response expected by the attacker (i.e., the "poisoned response"). Currently, the detection methods for such poisoning attacks are very limited. ### Paper Objectives To fill this gap, this paper proposes a new automated detection pipeline **RevPRAG**, which utilizes the activation patterns of LLMs to detect poisoned responses. Specifically, the authors found that there are significant differences in the activation patterns of LLMs when generating correct responses and poisoned responses. Based on this observation, they designed a systematic detection framework that can effectively identify poisoning attacks in RAG systems. ### Main Contributions 1. **Discovery of differences in LLM activation patterns**: The authors' empirical analysis shows that there are obvious differences in the activation patterns of LLMs when generating correct responses and poisoned responses. 2. **Proposal of the RevPRAG detection pipeline**: This is a flexible and automated detection pipeline that can effectively detect poisoned responses in RAG systems and support the construction of new datasets to reflect emerging RAG attacks. 3. **Extensive experimental verification**: This method has been systematically evaluated on multiple LLM architectures and retrievers, demonstrating an accuracy of over 98%. ### Method Overview - **Data collection**: First, collect the poisoned texts used for the attack and mark whether these texts cause the LLM to generate poisoned responses. - **Activation collection and processing**: Extract and process the activation patterns of the LLM when generating responses, especially the activation of the last layer. - **Model design**: Design the RevPRAG model based on the Siamese network and the triplet loss function, achieving robust detection by minimizing the distance between samples of the same class and maximizing the distance between samples of different classes. ### Conclusion Through this method, RevPRAG can provide efficient poisoning - attack detection without changing the RAG workflow, thereby enhancing the security and robustness of the RAG system. --- Hope this summary can help you understand the core problem of the paper and its solution. If you have more questions or need further explanation, please feel free to let us know!

Knowledge Database or Poison Base? Detecting RAG Poisoning Attack through LLM Activations

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models

BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models

On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains

Human-Imperceptible Retrieval Poisoning Attacks in LLM-Powered Applications

HijackRAG: Hijacking Attacks against Retrieval-Augmented Large Language Models

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG

RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks

Retrieval-Augmented Generation for Large Language Models: A Survey

Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models

WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs

Learning to Poison Large Language Models During Instruction Tuning

Poisoned LangChain: Jailbreak LLMs by LangChain

"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models

Typos that Broke the RAG's Back: Genetic Attack on RAG Pipeline by Simulating Documents in the Wild via Low-level Perturbations

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents