GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Yijun Yang,Ruiyuan Gao,Xiao Yang,Jianyuan Zhong,Qiang Xu
2024-10-30
Abstract:Recent advancements in Text-to-Image (T2I) models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as NSFW classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GuardT2I, a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Instead of making a binary classification, GuardT2I utilizes a Large Language Model (LLM) to conditionally transform text guidance embeddings within the T2I models into natural language for effective adversarial prompt detection, without compromising the models' inherent performance. Our extensive experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios. Our framework is available at <a class="link-external link-https" href="https://github.com/cure-lab/GuardT2I" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to protect Text - to - Image (T2I) models from adversarial prompts and prevent these models from generating Not - Safe - For - Work (NSFW) content. Although existing countermeasures such as NSFW classifiers or model fine - tuning can remove inappropriate concepts, these methods still have shortcomings when facing complex adversarial prompts. These adversarial prompts may seem harmless but can manipulate T2I models to generate explicit NSFW content, such as pornographic, violent, and politically sensitive content. Specifically, the paper proposes a new defense framework - **GUARD T2I**, which adopts a generative method to enhance the robustness of T2I models against adversarial prompts. Different from the traditional binary classification method, GUARD T2I utilizes large - language models (LLM) to convert the text - guided embedding conditions in T2I models into natural languages, thereby effectively detecting adversarial prompts without compromising the inherent performance of the models. ### Main contributions: 1. **Generative - paradigm defense framework**: GUARD T2I is the first generative - paradigm defense framework specifically designed for T2I models. By converting the latent variables of T2I models into natural languages, this framework not only performs well in various adversarial prompts but also provides decision explanations. 2. **Conditional LLM (c·LLM)**: A conditional LLM is proposed to translate latent representations back into plain text and combine a two - layer parsing method for prompt auditing. 3. **Extensive evaluation**: An extensive evaluation of GUARD T2I has been carried out, including strict adaptive attacks against various malicious attacks. The results show that GUARD T2I significantly outperforms other baseline methods, especially when facing adaptive attacks. ### Method overview: - **Prompt Interpretation**: Convert the implicit guided embeddings into natural languages to reveal the user's true intentions. - **Two - layer parsing mechanism**: It includes a Verbalizer and a Sentence Similarity Checker. The former is used to check explicit vocabulary, and the latter is used to detect the similarity between the generated prompt interpretations and the original input. - **Generation process control**: When an adversarial prompt is detected, GUARD T2I can stop the diffusion process at an early stage, reducing computational costs. ### Experimental results: - **High detection accuracy**: GUARD T2I has an average AUROC of 98.36% and an average AUPRC of 98.51% on multiple adversarial prompt datasets, significantly outperforming other baseline methods. - **Low false - positive rate and attack success rate**: The average FPR@TPR95 is 19.26%, and the average ASR is 8.75%, both far lower than the baseline methods. - **Little impact on normal use**: In normal cases, GUARD T2I hardly affects image quality and text alignment, and the FPR@TPR95 is only 18.39%, significantly lower than the best baseline method. In conclusion, by introducing the GUARD T2I framework, this paper effectively solves the security challenges of T2I models when facing adversarial prompts and ensures that the models have strong defense capabilities while generating high - quality images.