Abstract:The widespread use of Large Language Models (LLMs) in society creates new information security challenges for developers, organizations, and end-users alike. LLMs are trained on large volumes of data, and their susceptibility to reveal the exact contents of the source training datasets poses security and safety risks. Although current alignment procedures restrict common risky behaviors, they do not completely prevent LLMs from leaking data. Prior work demonstrated that LLMs may be tricked into divulging training data by using out-of-distribution queries or adversarial techniques. In this paper, we demonstrate a simple, query-based decompositional method to extract news articles from two frontier LLMs. We use instruction decomposition techniques to incrementally extract fragments of training data. Out of 3723 New York Times articles, we extract at least one verbatim sentence from 73 articles, and over 20% of verbatim sentences from 6 articles. Our analysis demonstrates that this method successfully induces the LLM to generate texts that are reliable reproductions of news articles, meaning that they likely originate from the source training dataset. This method is simple, generalizable, and does not fine-tune or change the production model. If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities, including privacy risks and unauthorized data leaks. These implications require careful consideration from model development to its end-use.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the information security challenges associated with large language models (LLMs) in terms of training data leakage. Specifically, the paper explores how to extract training data from two state-of-the-art large language models using a query-based decomposition method. #### Background and Motivation - **Information Security Challenges**: With the widespread application of large language models in society, these models are trained on vast amounts of data, which makes them potentially capable of leaking specific contents of the source training datasets, thereby posing security and privacy risks. - **Limitations of Existing Methods**: Although current alignment techniques can limit some common high-risk behaviors, they cannot completely prevent LLMs from leaking data. Previous research has shown that using out-of-distribution queries or adversarial techniques can induce LLMs to leak training data. #### Research Objectives - **Propose a New Extraction Method**: The paper proposes a simple, query-based decomposition method that gradually extracts training data fragments through instruction decomposition techniques. - **Validate the Method's Effectiveness**: The effectiveness of this method is experimentally validated, particularly its performance on a news article dataset. - **Assess Potential Risks**: The paper analyzes the new security and privacy vulnerabilities that this method might introduce, including privacy risks and unauthorized data leakage. #### Main Contributions - **Method Innovation**: A simple and general method is proposed that does not rely on fine-tuning or altering the production model. - **Empirical Study**: The paper demonstrates the success rate of this method through experiments on 3723 articles from The New York Times and 1349 articles from The Wall Street Journal. - **Security Warning**: The paper emphasizes the security and privacy risks that this method might bring, urging model developers and users to carefully consider these risks during development and usage. ### Summary By proposing a query-based decomposition method, this paper successfully extracts training data fragments from large language models, revealing the potential risks of data leakage in these models. This research not only provides new directions for future studies but also raises important warnings for the security and privacy protection of models.

Extracting Memorized Training Data via Decomposition

Scalable Extraction of Training Data from (Production) Language Models

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models

Extracting Training Data from Large Language Models

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Demystifying Verbatim Memorization in Large Language Models

Decoding Secret Memorization in Code LLMs Through Token-Level Characterization

On Extracting Specialized Code Abilities from Large Language Models: A Feasibility Study

Extracting Unlearned Information from LLMs with Activation Steering

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Understanding Memorisation in LLMs: Dynamics, Influencing Factors, and Implications

Training Data Extraction From Pre-trained Language Models: A Survey

To Each (Textual Sequence) Its Own: Improving Memorized-Data Unlearning in Large Language Models

Teach LLMs to Phish: Stealing Private Information from Language Models

Training Data Leakage Analysis in Language Models

What can we learn from Data Leakage and Unlearning for Law?

Distilling LLMs' Decomposition Abilities into Compact Language Models

Training on the Benchmark Is Not All You Need

Measuring memorization through probabilistic discoverable extraction