Abstract:State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$\cdot$2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that these models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion's tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

When Image Generation Goes Wrong: A Safety Analysis of Stable Diffusion Models

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

SneakyPrompt: Jailbreaking Text-to-image Generative Models

Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding

Backdooring Bias into Text-to-Image Models

Toward Robust Imperceptible Perturbation against Unauthorized Text-to-image Diffusion-based Synthesis

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

Understanding Implosion in Text-to-Image Generative Models

Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Revealing Vulnerabilities in Stable Diffusion via Targeted Attacks

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

Harm Amplification in Text-to-Image Models