Knowledge Database or Poison Base? Detecting RAG Poisoning Attack through LLM Activations

Xue Tan,Hao Luan,Mingyu Luo,Xiaoyan Sun,Ping Chen,Jun Dai
2024-11-28
Abstract:As Large Language Models (LLMs) are progressively deployed across diverse fields and real-world applications, ensuring the security and robustness of LLMs has become ever more critical. Retrieval-Augmented Generation (RAG) is a cutting-edge approach designed to address the limitations of large language models (LLMs). By retrieving information from the relevant knowledge database, RAG enriches the input to LLMs, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%. We also evaluate recent backdoor detection methods specifically designed for LLMs and applicable for identifying poisoned responses in RAG. The results demonstrate that our approach significantly surpasses them.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to detect knowledge - base poisoning attacks in Retrieval - Augmented Generation (RAG) systems**. ### Problem Background With the wide application of large - scale language models (LLMs) in various fields, it is crucial to ensure the security and robustness of these models. RAG is a method of enhancing LLM input by retrieving information from relevant knowledge bases, thereby generating more accurate and context - appropriate responses. However, since the knowledge base is sourced from public channels (such as Wikipedia), this introduces a new attack surface - RAG poisoning attacks. This type of attack injects malicious text into the knowledge base, causing the LLM to generate the response expected by the attacker (i.e., the "poisoned response"). Currently, the detection methods for such poisoning attacks are very limited. ### Paper Objectives To fill this gap, this paper proposes a new automated detection pipeline **RevPRAG**, which utilizes the activation patterns of LLMs to detect poisoned responses. Specifically, the authors found that there are significant differences in the activation patterns of LLMs when generating correct responses and poisoned responses. Based on this observation, they designed a systematic detection framework that can effectively identify poisoning attacks in RAG systems. ### Main Contributions 1. **Discovery of differences in LLM activation patterns**: The authors' empirical analysis shows that there are obvious differences in the activation patterns of LLMs when generating correct responses and poisoned responses. 2. **Proposal of the RevPRAG detection pipeline**: This is a flexible and automated detection pipeline that can effectively detect poisoned responses in RAG systems and support the construction of new datasets to reflect emerging RAG attacks. 3. **Extensive experimental verification**: This method has been systematically evaluated on multiple LLM architectures and retrievers, demonstrating an accuracy of over 98%. ### Method Overview - **Data collection**: First, collect the poisoned texts used for the attack and mark whether these texts cause the LLM to generate poisoned responses. - **Activation collection and processing**: Extract and process the activation patterns of the LLM when generating responses, especially the activation of the last layer. - **Model design**: Design the RevPRAG model based on the Siamese network and the triplet loss function, achieving robust detection by minimizing the distance between samples of the same class and maximizing the distance between samples of different classes. ### Conclusion Through this method, RevPRAG can provide efficient poisoning - attack detection without changing the RAG workflow, thereby enhancing the security and robustness of the RAG system. --- Hope this summary can help you understand the core problem of the paper and its solution. If you have more questions or need further explanation, please feel free to let us know!