Abstract:Recent advancements in Text-to-Image (T2I) models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as NSFW classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GuardT2I, a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Instead of making a binary classification, GuardT2I utilizes a Large Language Model (LLM) to conditionally transform text guidance embeddings within the T2I models into natural language for effective adversarial prompt detection, without compromising the models' inherent performance. Our extensive experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios. Our framework is available at <a class="link-external link-https" href="https://github.com/cure-lab/GuardT2I" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to protect Text - to - Image (T2I) models from adversarial prompts and prevent these models from generating Not - Safe - For - Work (NSFW) content. Although existing countermeasures such as NSFW classifiers or model fine - tuning can remove inappropriate concepts, these methods still have shortcomings when facing complex adversarial prompts. These adversarial prompts may seem harmless but can manipulate T2I models to generate explicit NSFW content, such as pornographic, violent, and politically sensitive content. Specifically, the paper proposes a new defense framework - **GUARD T2I**, which adopts a generative method to enhance the robustness of T2I models against adversarial prompts. Different from the traditional binary classification method, GUARD T2I utilizes large - language models (LLM) to convert the text - guided embedding conditions in T2I models into natural languages, thereby effectively detecting adversarial prompts without compromising the inherent performance of the models. ### Main contributions: 1. **Generative - paradigm defense framework**: GUARD T2I is the first generative - paradigm defense framework specifically designed for T2I models. By converting the latent variables of T2I models into natural languages, this framework not only performs well in various adversarial prompts but also provides decision explanations. 2. **Conditional LLM (c·LLM)**: A conditional LLM is proposed to translate latent representations back into plain text and combine a two - layer parsing method for prompt auditing. 3. **Extensive evaluation**: An extensive evaluation of GUARD T2I has been carried out, including strict adaptive attacks against various malicious attacks. The results show that GUARD T2I significantly outperforms other baseline methods, especially when facing adaptive attacks. ### Method overview: - **Prompt Interpretation**: Convert the implicit guided embeddings into natural languages to reveal the user's true intentions. - **Two - layer parsing mechanism**: It includes a Verbalizer and a Sentence Similarity Checker. The former is used to check explicit vocabulary, and the latter is used to detect the similarity between the generated prompt interpretations and the original input. - **Generation process control**: When an adversarial prompt is detected, GUARD T2I can stop the diffusion process at an early stage, reducing computational costs. ### Experimental results: - **High detection accuracy**: GUARD T2I has an average AUROC of 98.36% and an average AUPRC of 98.51% on multiple adversarial prompt datasets, significantly outperforming other baseline methods. - **Low false - positive rate and attack success rate**: The average FPR@TPR95 is 19.26%, and the average ASR is 8.75%, both far lower than the baseline methods. - **Little impact on normal use**: In normal cases, GUARD T2I hardly affects image quality and text alignment, and the FPR@TPR95 is only 18.39%, significantly lower than the best baseline method. In conclusion, by introducing the GUARD T2I framework, this paper effectively solves the security challenges of T2I models when facing adversarial prompts and ensures that the models have strong defense capabilities while generating high - quality images.

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Latent Guard: a Safety Framework for Text-to-image Generation

SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks

Exploring the Boundaries of Content Moderation in Text-to-Image Generation

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers

Harm Amplification in Text-to-Image Models

STAND-Guard: A Small Task-Adaptive Content Moderation Model

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Multimodal Pragmatic Jailbreak on Text-to-image Models

Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation

Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding

T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models