Abstract:The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of "memorization." We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has "memorized" a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from "extraction" (user intentionally causes a model to generate a near-exact copy), from "regurgitation" (model generates a near-exact copy, regardless of user intentions), and from "reconstruction" (the near-exact copy can be obtained from the model by any means). Several consequences follow. (1) Not all learning is memorization. (2) Memorization occurs when a model is trained; regurgitation is a symptom not its cause. (3) A model that has memorized training data is a "copy" of that training data in the sense used by copyright. (4) A model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly regurgitated ones) than others. (5) Memorization is not a phenomenon caused by "adversarial" users bent on extraction; it is latent in the model itself. (6) The amount of training data that a model memorizes is a consequence of choices made in training. (7) Whether or not a model that has memorized actually regurgitates depends on overall system design. In a very real sense, memorized training data is in the model--to quote Zoolander, the files are in the computer.

Unveiling Memorization in Code Models

Traces of Memorisation in Large Language Models for Code

Memorization and Generalization in Neural Code Intelligence Models

SoK: Memorization in General-Purpose Large Language Models

Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI Models

A Multi-Perspective Analysis of Memorization in Large Language Models

Detecting Memorization in Large Language Models

Emergent and Predictable Memorization in Large Language Models

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

Undesirable Memorization in Large Language Models: A Survey

Counterfactual Memorization in Neural Language Models

ROME: Memorization Insights from Text, Probability and Hidden State in Large Language Models

Unlocking Memorization in Large Language Models with Dynamic Soft Prompting

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

The Files are in the Computer: Copyright, Memorization, and Generative AI

Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization

Investigating Memorization in Video Diffusion Models

Machine Learning Models that Remember Too Much

Detecting, Explaining, and Mitigating Memorization in Diffusion Models

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Demystifying Verbatim Memorization in Large Language Models