PLeak: Prompt Leaking Attacks against Large Language Model Applications

Bo Hui,Haolin Yuan,Neil Gong,Philippe Burlina,Yinzhi Cao
2024-05-14
Abstract:Large Language Models (LLMs) enable a new ecosystem with many downstream applications, called LLM applications, with different natural language processing tasks. The functionality and performance of an LLM application highly depend on its system prompt, which instructs the backend LLM on what task to perform. Therefore, an LLM application developer often keeps a system prompt confidential to protect its intellectual property. As a result, a natural attack, called prompt leaking, is to steal the system prompt from an LLM application, which compromises the developer's intellectual property. Existing prompt leaking attacks primarily rely on manually crafted queries, and thus achieve limited effectiveness.
Cryptography and Security,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the system prompt leakage attack in the application of large - language models (LLMs). Specifically, the functions and performance of LLM applications are highly dependent on their system prompts, and these system prompts are usually kept confidential by developers to protect intellectual property rights. Therefore, a natural attack method - called "prompt leaking" - aims to steal system prompts from LLM applications, thereby harming the developers' intellectual property rights. Existing prompt - leaking attacks mainly rely on manually - designed queries and have limited effectiveness. For this reason, this paper proposes a new black - box prompt - leaking attack framework, called PLeak, which optimizes adversarial queries so that the queries sent by attackers can make the target LLM application disclose its system prompt in the response. PLeak achieves this goal by formulating the problem of finding such adversarial queries as an optimization problem and using a gradient - based method for approximate solution. Its core idea is to gradually optimize the adversarial queries, that is, starting from the first few tokens of each system prompt and gradually increasing until the length of the entire system prompt. The main contributions of PLeak include: - Proposing the first automated prompt - leaking attack, using two novel techniques, namely incremental search and post - processing. The former allows PLeak to gradually optimize adversarial queries to maximize the leaked information; the latter allows PLeak to aggregate the responses of multiple adversarial queries to bypass potential defense mechanisms. - Evaluating PLeak on actual LLM applications, showing that it can accurately reconstruct 68% of the system prompts of actual LLM applications. - Demonstrating that PLeak outperforms previous works that require manually - designed queries on both offline LLM applications and online LLM applications. In conclusion, this research aims to reveal the deficiencies of the current system prompt protection mechanisms in LLM applications by developing effective prompt - leaking attack methods, thereby promoting the development and use of more secure LLM applications.