Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

Aly M. Kassem,Omar Mahmoud,Niloofar Mireshghallah,Hyunwoo Kim,Yulia Tsvetkov,Yejin Choi,Sherif Saad,Santu Rana
2024-03-31
Abstract:In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2) maximal overlap between the victim model's output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that (1) instruction-tuned models can expose pre-training data as much as their base-models, if not more so, (2) contexts other than the original training data can lead to leakage, and (3) using instructions proposed by other LLMs can open a new avenue of automated attacks that we should further study and explore. The code can be found at
Computer Science
What problem does this paper attempt to address?
This paper mainly explores the problem of data memorization in large language models (LLMs), particularly for models fine-tuned with instructions. The researchers propose a black-box prompting optimization method that uses an adversary LLM to reveal a higher degree of memorization of the victim model towards the pre-training data, rather than simply using the training data directly. They design an iterative rejection sampling optimization process to search for prompting that has the following two characteristics: (1) minimal overlap with the training data to avoid directly revealing the answer; (2) maximization of the overlap between the victim model's output and the training data to encourage the model to output the training data. The research findings show that prompts generated using this method can achieve a 23.7% higher overlap with the training data compared to the baseline prefix-suffix measurement. The paper also points out that besides the original training data, other contexts can also lead to information leakage, and prompts proposed by other LLMs may open up new avenues for automated attacks, requiring further research. Furthermore, the paper compares the discoverable memorization of baseline models and instruction-based fine-tuned models, and finds that although instruction-based fine-tuning is generally believed to reduce memorization, it can actually expose more pre-training data, leading to 12.4% higher memorization discovered with the new method in instruction-based fine-tuned models. The experiments demonstrate that the new method outperforms the white-box CGC method in terms of training data reconstruction overlap, and when used as an attacker with open-source models, it may even surpass powerful commercial models. In summary, the paper reveals that the current methods for assessing the memorization of LLMs may underestimate the models' memorization ability towards pre-training data, and proposes an optimization method to more accurately detect such memorization, highlighting the need for privacy protection and better model alignment.