Abstract:Generative AI based on large language models such as ChatGPT, DALL·E-2, Midjourney, Stable Diffusion, JukeBox, and MusicLM can produce text, images, and music that are indistinguishable from human-authored works. The training data for these large language models consists predominantly of copyrighted works. This Article explores how generative AI fits within fair use rulings established in relation to previous generations of copy-reliant technology, including software reverse engineering, automated plagiarism detection systems, and the text data mining at the heart of the landmark HathiTrust and Google Books cases. Although there is no machine learning exception to the principle of non-expressive use, the largeness of likelihood models suggest that they are capable of memorizing and reconstituting works in the training data, something that is incompatible with non-expressive use. At the moment, memorization is an edge case. For the most part, the link between the training data and the output of generative AI is attenuated by a process of decomposition, abstraction, and remix. Generally, pseudo-expression generated by large language models does not infringe copyright because these models “learn” latent features and associations within the training data, they do not memorize snippets of original expression from individual works. However, this Article identifies particular situations in the context of text-to-image models where memorization of the training data is more likely. The computer science literature suggests that memorization is more likely when: models are trained on many duplicates of the same work; images are associated with unique text descriptions; and the ratio of the size of the model to the training data is relatively large. This Article shows how these problems are accentuated in the context of copyrightable characters and proposes a set of guidelines for “Copyright Safety for Generative AI” to reduce the risk of copyright infringement.

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Copyright Violations and Large Language Models

Evaluating Copyright Takedown Methods for Language Models

How to Protect Copyright Data in Optimization of Large Language Models?

Measuring Copyright Risks of Large Language Model via Partial Information Probing

Copyright Safety for Generative AI

Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?

Neural Authorship Attribution: Stylometric Analysis on Large Language Models

Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

LLMs and Memorization: On Quality and Specificity of Copyright Compliance

Ethical Considerations and Policy Implications for Large Language Models: Guiding Responsible Development and Deployment

LLMs Plagiarize: Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison

Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and Ethics

DE-COP: Detecting Copyrighted Content in Language Models Training Data

SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

Copyright Traps for Large Language Models

Large language models are changing landscape of academic publications. A positive transformation?

CopyLens: Dynamically Flagging Copyrighted Sub-Dataset Contributions to LLM Outputs

Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers