Abstract:Text-to-image diffusion models have been demonstrated with undesired generation due to unfiltered large-scale training data, such as sexual images and copyrights, necessitating the erasure of undesired concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing target concepts. However, they fail to guarantee the desired generation of texts unseen in the training phase, especially for the adversarial texts from malicious attacks. In this paper, we analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of undesired generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. This method greedily mines embeddings with maximum generation probabilities of target concepts and more effectively reduces their generation. In the experiments, we evaluate its performance on the inappropriateness, object, and style concepts. Compared with the previous methods, our method achieves better erasure and defense results, especially under multiple adversarial attacks, while preserving the native generation capability of the models. Our code will be available at <a class="link-external link-https" href="https://github.com/RichardSunnyMeng/DarkMiner-offical-codes" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to effectively prevent text - to - image diffusion models from generating unwanted content (such as pornographic images and copyrighted images), especially in the face of malicious attacks, to ensure that these models do not generate images containing target concepts. Existing methods mainly focus on modifying the generation probabilities related to specific texts, but they cannot guarantee effective defense against texts not seen in the training phase or adversarial texts.
### Analysis of the Core Problems in the Paper
1. **Background Problems**:
- Text - to - image diffusion models are trained on large - scale unfiltered datasets, resulting in the generated content may contain unwanted concepts, such as pornography, nudity, etc.
- These unwanted contents not only affect social harmony and stability but also hinder the safe use of the generation models.
2. **Limitations of Existing Methods**:
- Existing methods mainly reduce the probability of unwanted generated content under specific text conditions by modifying the generation distribution.
- But these methods cannot comprehensively cover all possible text inputs, especially those texts not seen in the training phase or adversarial texts, so there are loopholes in practical applications.
3. **New Method Proposed in the Paper**:
- To overcome the limitations of existing methods, the author proposes a new framework - Dark Miner.
- Dark Miner reduces the overall probability of unwanted generated content more effectively through an iterative three - stage process (mining, verification, and evasion).
### Overview of the Dark Miner Method
1. **Mining Stage**:
- Mine the text embeddings related to the target concept and with the highest generation probability.
- Use formula (5) to optimize these embeddings to minimize the denoising error:
\[
L_M=\mathbb{E}_{x\in P_I,k,t,\epsilon}\left[\left\|\epsilon - \epsilon_\theta(x_t|c,t)\right\|^2_2\right]
\]
2. **Verification Stage**:
- Verify whether these embeddings will actually lead to the generation of the target concept.
- Use the CLIP model to calculate the cosine similarity of delta features, and formula (6) is:
\[
s(c)=\frac{1}{k}\sum_{x_e\in P_I,k}\frac{(E(x_c)-E(x_r))^T(E(x_e)-E(x_r))}{\left\|E(x_c)-E(x_r)\right\|_2\left\|E(x_e)-E(x_r)\right\|_2}
\]
3. **Evasion Stage**:
- Modify the generation distribution corresponding to these embeddings to reduce their generation probability.
- Use formula (7) to define the evasion loss function:
\[
l_c=\mathbb{E}_{x,t,\epsilon}\left[\left\|\epsilon_\theta(x_t|t,c)-\epsilon_{\theta_0}(x_t|t,c_0)\right\|^2_2\right]
\]
- At the same time, retain some special points to protect the generation ability of irrelevant images, and formula (8) is:
\[
l_p=\mathbb{E}_{x,t,\epsilon}\left[\left\|\epsilon_\theta(x_t|t,c_0)-\epsilon_{\theta_0}(x_t|t,c_0)\right\|^2_2\right]+\mathbb{E}_{x,t,\epsilon}\leq