Abstract:The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of "memorization." We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has "memorized" a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from "extraction" (user intentionally causes a model to generate a near-exact copy), from "regurgitation" (model generates a near-exact copy, regardless of user intentions), and from "reconstruction" (the near-exact copy can be obtained from the model by any means). Several consequences follow. (1) Not all learning is memorization. (2) Memorization occurs when a model is trained; regurgitation is a symptom not its cause. (3) A model that has memorized training data is a "copy" of that training data in the sense used by copyright. (4) A model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly regurgitated ones) than others. (5) Memorization is not a phenomenon caused by "adversarial" users bent on extraction; it is latent in the model itself. (6) The amount of training data that a model memorizes is a consequence of choices made in training. (7) Whether or not a model that has memorized actually regurgitates depends on overall system design. In a very real sense, memorized training data is in the model--to quote Zoolander, the files are in the computer.

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

Counterfactual Memorization in Neural Language Models

Demystifying Verbatim Memorization in Large Language Models

Measuring Forgetting of Memorized Training Examples

Detecting Memorization in Large Language Models

Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI Models

Mitigating Memorization In Language Models

Undesirable Memorization in Large Language Models: A Survey

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

SoK: Memorization in General-Purpose Large Language Models

Unintended Memorization in Large ASR Models, and How to Mitigate It

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

The Files are in the Computer: Copyright, Memorization, and Generative AI

Mitigating Approximate Memorization in Language Models via Dissimilarity Learned Policy

Data-centric NLP Backdoor Defense from the Lens of Memorization

Unveiling Memorization in Code Models

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Continual Memorization of Factoids in Large Language Models

Investigating Memorization in Video Diffusion Models

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit