PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models

Hongwei Yao,Jian Lou,Zhan Qin

2023-12-18

Abstract:Prompts have significantly improved the performance of pretrained Large Language Models (LLMs) on various downstream tasks recently, making them increasingly indispensable for a diverse range of LLM application scenarios. However, the backdoor vulnerability, a serious security threat that can maliciously alter the victim model's normal predictions, has not been sufficiently explored for prompt-based LLMs. In this paper, we present POISONPROMPT, a novel backdoor attack capable of successfully compromising both hard and soft prompt-based LLMs. We evaluate the effectiveness, fidelity, and robustness of POISONPROMPT through extensive experiments on three popular prompt methods, using six datasets and three widely used LLMs. Our findings highlight the potential security threats posed by backdoor attacks on prompt-based LLMs and emphasize the need for further research in this area.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security threats faced by large - language models (LLMs) when using prompt techniques for downstream tasks, especially backdoor attacks. Although prompt techniques have significantly improved the performance of pre - trained large - language models in various downstream tasks, their security has not been fully explored. This paper proposes a new backdoor attack method - **POISON PROMPT**, which aims to maliciously change the normal prediction results of the model by injecting specific triggers, thereby implementing an effective backdoor attack on prompt - based large - language models. Specifically, the paper explores how to inject backdoors during prompt tuning and proposes a two - layer optimization framework to achieve this goal. This framework not only optimizes the triggers used to activate backdoor behavior but also simultaneously optimizes the prompt - tuning task to maintain the performance of the model on downstream tasks. Through extensive experiments, the authors evaluate the effectiveness, fidelity, and robustness of POISON PROMPT and emphasize the potential security threats of backdoor attacks against prompt - based large - language models, calling for further research in this area.

PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models

Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models

Prompt Stealing Attacks Against Large Language Models

PRSA: PRompt Stealing Attacks against Large Language Models

PLeak: Prompt Leaking Attacks against Large Language Model Applications

SoK: Prompt Hacking of Large Language Models

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Goal-Oriented Prompt Attack and Safety Evaluation for LLMs

ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs

Exploring the Universal Vulnerability of Prompt-based Learning Paradigm

Prompt Backdoors in Visual Prompt Learning

A Prompting-based Approach for Adversarial Example Generation and Robustness Enhancement

PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

Prompt Injection attack against LLM-integrated Applications

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

$\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models

Automatic and Universal Prompt Injection Attacks against Large Language Models

Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models

An LLM can Fool Itself: A Prompt-Based Adversarial Attack