Abstract:Text-to-image diffusion models have been demonstrated with undesired generation due to unfiltered large-scale training data, such as sexual images and copyrights, necessitating the erasure of undesired concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing target concepts. However, they fail to guarantee the desired generation of texts unseen in the training phase, especially for the adversarial texts from malicious attacks. In this paper, we analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of undesired generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. This method greedily mines embeddings with maximum generation probabilities of target concepts and more effectively reduces their generation. In the experiments, we evaluate its performance on the inappropriateness, object, and style concepts. Compared with the previous methods, our method achieves better erasure and defense results, especially under multiple adversarial attacks, while preserving the native generation capability of the models. Our code will be available at <a class="link-external link-https" href="https://github.com/RichardSunnyMeng/DarkMiner-offical-codes" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to effectively prevent text - to - image diffusion models from generating unwanted content (such as pornographic images and copyrighted images), especially in the face of malicious attacks, to ensure that these models do not generate images containing target concepts. Existing methods mainly focus on modifying the generation probabilities related to specific texts, but they cannot guarantee effective defense against texts not seen in the training phase or adversarial texts. ### Analysis of the Core Problems in the Paper 1. **Background Problems**: - Text - to - image diffusion models are trained on large - scale unfiltered datasets, resulting in the generated content may contain unwanted concepts, such as pornography, nudity, etc. - These unwanted contents not only affect social harmony and stability but also hinder the safe use of the generation models. 2. **Limitations of Existing Methods**: - Existing methods mainly reduce the probability of unwanted generated content under specific text conditions by modifying the generation distribution. - But these methods cannot comprehensively cover all possible text inputs, especially those texts not seen in the training phase or adversarial texts, so there are loopholes in practical applications. 3. **New Method Proposed in the Paper**: - To overcome the limitations of existing methods, the author proposes a new framework - Dark Miner. - Dark Miner reduces the overall probability of unwanted generated content more effectively through an iterative three - stage process (mining, verification, and evasion). ### Overview of the Dark Miner Method 1. **Mining Stage**: - Mine the text embeddings related to the target concept and with the highest generation probability. - Use formula (5) to optimize these embeddings to minimize the denoising error: \[ L_M=\mathbb{E}_{x\in P_I,k,t,\epsilon}\left[\left\|\epsilon - \epsilon_\theta(x_t|c,t)\right\|^2_2\right] \] 2. **Verification Stage**: - Verify whether these embeddings will actually lead to the generation of the target concept. - Use the CLIP model to calculate the cosine similarity of delta features, and formula (6) is: \[ s(c)=\frac{1}{k}\sum_{x_e\in P_I,k}\frac{(E(x_c)-E(x_r))^T(E(x_e)-E(x_r))}{\left\|E(x_c)-E(x_r)\right\|_2\left\|E(x_e)-E(x_r)\right\|_2} \] 3. **Evasion Stage**: - Modify the generation distribution corresponding to these embeddings to reduce their generation probability. - Use formula (7) to define the evasion loss function: \[ l_c=\mathbb{E}_{x,t,\epsilon}\left[\left\|\epsilon_\theta(x_t|t,c)-\epsilon_{\theta_0}(x_t|t,c_0)\right\|^2_2\right] \] - At the same time, retain some special points to protect the generation ability of irrelevant images, and formula (8) is: \[ l_p=\mathbb{E}_{x,t,\epsilon}\left[\left\|\epsilon_\theta(x_t|t,c_0)-\epsilon_{\theta_0}(x_t|t,c_0)\right\|^2_2\right]+\mathbb{E}_{x,t,\epsilon}\leq

Dark Miner: Defend against undesired generation for text-to-image diffusion models

Toward Robust Imperceptible Perturbation against Unauthorized Text-to-image Diffusion-based Synthesis

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Circumventing Concept Erasure Methods For Text-to-Image Generative Models

Generating Natural Language Adversarial Examples on a Large Scale with Generative Models

MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models

EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts

DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models

Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks

Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

Ablating Concepts in Text-to-Image Diffusion Models

SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models

Discovering Failure Modes of Text-guided Diffusion Models via Adversarial Search

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models