Abstract:Large Language Models (LLMs) enable a new ecosystem with many downstream applications, called LLM applications, with different natural language processing tasks. The functionality and performance of an LLM application highly depend on its system prompt, which instructs the backend LLM on what task to perform. Therefore, an LLM application developer often keeps a system prompt confidential to protect its intellectual property. As a result, a natural attack, called prompt leaking, is to steal the system prompt from an LLM application, which compromises the developer's intellectual property. Existing prompt leaking attacks primarily rely on manually crafted queries, and thus achieve limited effectiveness.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the system prompt leakage attack in the application of large - language models (LLMs). Specifically, the functions and performance of LLM applications are highly dependent on their system prompts, and these system prompts are usually kept confidential by developers to protect intellectual property rights. Therefore, a natural attack method - called "prompt leaking" - aims to steal system prompts from LLM applications, thereby harming the developers' intellectual property rights. Existing prompt - leaking attacks mainly rely on manually - designed queries and have limited effectiveness. For this reason, this paper proposes a new black - box prompt - leaking attack framework, called PLeak, which optimizes adversarial queries so that the queries sent by attackers can make the target LLM application disclose its system prompt in the response. PLeak achieves this goal by formulating the problem of finding such adversarial queries as an optimization problem and using a gradient - based method for approximate solution. Its core idea is to gradually optimize the adversarial queries, that is, starting from the first few tokens of each system prompt and gradually increasing until the length of the entire system prompt. The main contributions of PLeak include: - Proposing the first automated prompt - leaking attack, using two novel techniques, namely incremental search and post - processing. The former allows PLeak to gradually optimize adversarial queries to maximize the leaked information; the latter allows PLeak to aggregate the responses of multiple adversarial queries to bypass potential defense mechanisms. - Evaluating PLeak on actual LLM applications, showing that it can accurately reconstruct 68% of the system prompts of actual LLM applications. - Demonstrating that PLeak outperforms previous works that require manually - designed queries on both offline LLM applications and online LLM applications. In conclusion, this research aims to reveal the deficiencies of the current system prompt protection mechanisms in LLM applications by developing effective prompt - leaking attack methods, thereby promoting the development and use of more secure LLM applications.

PLeak: Prompt Leaking Attacks against Large Language Model Applications

Prompt Stealing Attacks Against Large Language Models

SoK: Prompt Hacking of Large Language Models

PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models

Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models

Prompt Injection attack against LLM-integrated Applications

PRSA: PRompt Stealing Attacks against Large Language Models

Prompt Leakage effect and defense strategies for multi-turn LLM interactions

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Automatic and Universal Prompt Injection Attacks against Large Language Models

Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models

Effective Prompt Extraction from Language Models

Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models

Imprompter: Tricking LLM Agents into Improper Tool Use

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs

Goal-Oriented Prompt Attack and Safety Evaluation for LLMs

SPML: A DSL for Defending Language Models Against Prompt Attacks

MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants

Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications