Abstract:In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2) maximal overlap between the victim model's output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that (1) instruction-tuned models can expose pre-training data as much as their base-models, if not more so, (2) contexts other than the original training data can lead to leakage, and (3) using instructions proposed by other LLMs can open a new avenue of automated attacks that we should further study and explore. The code can be found at

What problem does this paper attempt to address?

This paper mainly explores the problem of data memorization in large language models (LLMs), particularly for models fine-tuned with instructions. The researchers propose a black-box prompting optimization method that uses an adversary LLM to reveal a higher degree of memorization of the victim model towards the pre-training data, rather than simply using the training data directly. They design an iterative rejection sampling optimization process to search for prompting that has the following two characteristics: (1) minimal overlap with the training data to avoid directly revealing the answer; (2) maximization of the overlap between the victim model's output and the training data to encourage the model to output the training data. The research findings show that prompts generated using this method can achieve a 23.7% higher overlap with the training data compared to the baseline prefix-suffix measurement. The paper also points out that besides the original training data, other contexts can also lead to information leakage, and prompts proposed by other LLMs may open up new avenues for automated attacks, requiring further research. Furthermore, the paper compares the discoverable memorization of baseline models and instruction-based fine-tuned models, and finds that although instruction-based fine-tuning is generally believed to reduce memorization, it can actually expose more pre-training data, leading to 12.4% higher memorization discovered with the new method in instruction-based fine-tuned models. The experiments demonstrate that the new method outperforms the white-box CGC method in terms of training data reconstruction overlap, and when used as an attacker with open-source models, it may even surpass powerful commercial models. In summary, the paper reveals that the current methods for assessing the memorization of LLMs may underestimate the models' memorization ability towards pre-training data, and proposes an optimization method to more accurately detect such memorization, highlighting the need for privacy protection and better model alignment.

Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Rethinking LLM Memorization through the Lens of Adversarial Compression

Unlocking Memorization in Large Language Models with Dynamic Soft Prompting

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

Mitigating Memorization In Language Models

Imprompter: Tricking LLM Agents into Improper Tool Use

Understanding Memorisation in LLMs: Dynamics, Influencing Factors, and Implications

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Unveiling the Lexical Sensitivity of LLMs: Combinatorial Optimization for Prompt Enhancement

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants

MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory

LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses

MemHunter: Automated and Verifiable Memorization Detection at Dataset-scale in LLMs

Prompt Leakage effect and defense strategies for multi-turn LLM interactions

Detecting Memorization in Large Language Models

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information