Abstract:Generative AI based on large language models such as ChatGPT, DALL·E-2, Midjourney, Stable Diffusion, JukeBox, and MusicLM can produce text, images, and music that are indistinguishable from human-authored works. The training data for these large language models consists predominantly of copyrighted works. This Article explores how generative AI fits within fair use rulings established in relation to previous generations of copy-reliant technology, including software reverse engineering, automated plagiarism detection systems, and the text data mining at the heart of the landmark HathiTrust and Google Books cases. Although there is no machine learning exception to the principle of non-expressive use, the largeness of likelihood models suggest that they are capable of memorizing and reconstituting works in the training data, something that is incompatible with non-expressive use. At the moment, memorization is an edge case. For the most part, the link between the training data and the output of generative AI is attenuated by a process of decomposition, abstraction, and remix. Generally, pseudo-expression generated by large language models does not infringe copyright because these models “learn” latent features and associations within the training data, they do not memorize snippets of original expression from individual works. However, this Article identifies particular situations in the context of text-to-image models where memorization of the training data is more likely. The computer science literature suggests that memorization is more likely when: models are trained on many duplicates of the same work; images are associated with unique text descriptions; and the ratio of the size of the model to the training data is relatively large. This Article shows how these problems are accentuated in the context of copyrightable characters and proposes a set of guidelines for “Copyright Safety for Generative AI” to reduce the risk of copyright infringement.

Safety and Fairness for Content Moderation in Generative Models

Social Risks in the Era of Generative AI

Exploring the Boundaries of Content Moderation in Text-to-Image Generation

From Melting Pots to Misrepresentations: Exploring Harms in Generative AI

Harm Amplification in Text-to-Image Models

The Security Risks of Generative Artificial Intelligence

Do Generative AI Models Output Harm while Representing Non-Western Cultures: Evidence from A Community-Centered Approach

A Survey on Responsible Generative AI: What to Generate and What Not

Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction

Copyright Protection and Accountability of Generative AI:Attack, Watermarking and Attribution

A Pathway Towards Responsible AI Generated Content

Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey

A Taxonomy of the Biases of the Images created by Generative Artificial Intelligence

Copyright Safety for Generative AI

Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups

Generative Artificial Intelligence and Copyright: Both Sides of the Black Box

On the Fairness, Diversity and Reliability of Text-to-Image Generative Models

When Image Generation Goes Wrong: A Safety Analysis of Stable Diffusion Models

Balancing Innovation and Regulation in the Age of Generative Artificial Intelligence

"There Has To Be a Lot That We're Missing": Moderating AI-Generated Content on Reddit