Abstract:Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model's weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.

Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

Universal Prompt Optimizer for Safe Text-to-Image Generation

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding

The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

SneakyPrompt: Jailbreaking Text-to-image Generative Models

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models

SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models

Dynamic Prompt Optimizing for Text-to-Image Generation

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers