PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models

Hongwei Yao,Jian Lou,Zhan Qin
2023-12-18
Abstract:Prompts have significantly improved the performance of pretrained Large Language Models (LLMs) on various downstream tasks recently, making them increasingly indispensable for a diverse range of LLM application scenarios. However, the backdoor vulnerability, a serious security threat that can maliciously alter the victim model's normal predictions, has not been sufficiently explored for prompt-based LLMs. In this paper, we present POISONPROMPT, a novel backdoor attack capable of successfully compromising both hard and soft prompt-based LLMs. We evaluate the effectiveness, fidelity, and robustness of POISONPROMPT through extensive experiments on three popular prompt methods, using six datasets and three widely used LLMs. Our findings highlight the potential security threats posed by backdoor attacks on prompt-based LLMs and emphasize the need for further research in this area.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security threats faced by large - language models (LLMs) when using prompt techniques for downstream tasks, especially backdoor attacks. Although prompt techniques have significantly improved the performance of pre - trained large - language models in various downstream tasks, their security has not been fully explored. This paper proposes a new backdoor attack method - **POISON PROMPT**, which aims to maliciously change the normal prediction results of the model by injecting specific triggers, thereby implementing an effective backdoor attack on prompt - based large - language models. Specifically, the paper explores how to inject backdoors during prompt tuning and proposes a two - layer optimization framework to achieve this goal. This framework not only optimizes the triggers used to activate backdoor behavior but also simultaneously optimizes the prompt - tuning task to maintain the performance of the model on downstream tasks. Through extensive experiments, the authors evaluate the effectiveness, fidelity, and robustness of POISON PROMPT and emphasize the potential security threats of backdoor attacks against prompt - based large - language models, calling for further research in this area.