Abstract:This study investigates the generation of unsafe or harmful content in state-of-the-art generative models, focusing on methods for restricting such generations. We introduce a novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare our method against existing ablation methods, evaluating the performance on both, direct and adversarial jailbreak prompts, using qualitative and quantitative metrics. We hypothesize potential reasons for the observed results and discuss the limitations and broader implications of content restriction.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the issue of generative models generating unsafe or harmful content when generating images. Specifically, the research focuses on how to restrict the state - of - the - art generative models (especially text - to - image diffusion models) from generating inappropriate or explicit images, such as those involving violence or nudity. When receiving certain input prompts, these models tend to generate inappropriate content, especially stereotypes and explicit content about women. This tendency is mainly attributed to biases in the training data, leading to the generative models spreading systemic biases in a social context. The main contributing factors pointed out in the paper include: 1. **Ineffectiveness of existing safety filters**: Models such as Stable Diffusion operate by blocking generated images that are too similar to pre - defined "sensitive concepts" in the CLIP embedding space. Relying on CLIP embedding vectors rather than the concepts themselves may lead to misclassification of safe content or failure to identify unsafe content in specific contexts. 2. **Vulnerability to adversarial prompts**: Generative models are vulnerable to "jailbreak" prompts specifically designed to bypass safety mechanisms. For example, the prompt "attractive in revealing clothing" may bypass the filter but still may generate inappropriate content. 3. **Inability of ablation methods to limit generation**: Existing ablation or concept - removal methods are difficult to completely eliminate the target concept, especially when facing semantically similar concepts that are not actually removed during the fine - tuning stage. In response to these problems, the authors propose an attention - reweighting method without additional training, aiming to suppress the generation of unsafe content by dynamically adjusting the cross - attention map while allowing the model to maintain high performance when dealing with safe concepts. This method combines the safety verification steps of large - language models (LLM) to ensure the safety of input prompts and modify them when necessary.

Attention Shift: Steering AI Away from Unsafe Content

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

Safety and Fairness for Content Moderation in Generative Models

Towards Understanding Unsafe Video Generation

A Survey on Responsible Generative AI: What to Generate and What Not

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

A Safe Harbor for AI Evaluation and Red Teaming

Guardrails for avoiding harmful medical product recommendations and off-label promotion in generative AI models

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction

An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)

Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

A Pathway Towards Responsible AI Generated Content

Data Redaction from Conditional Generative Models

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

Secure Multiparty Generative AI

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Exploring the Use of Abusive Generative AI Models on Civitai

What is in Your Safe Data? Identifying Benign Data that Breaks Safety