Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Chao Gong,Kai Chen,Zhipeng Wei,Jingjing Chen,Yu-Gang Jiang
2024-10-28
Abstract:Text-to-image models encounter safety issues, including concerns related to copyright and Not-Safe-For-Work (NSFW) content. Despite several methods have been proposed for erasing inappropriate concepts from diffusion models, they often exhibit incomplete erasure, consume a lot of computing resources, and inadvertently damage generation ability. In this work, we introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning. Specifically, RECE efficiently leverages a closed-form solution to derive new target embeddings, which are capable of regenerating erased concepts within the unlearned model. To mitigate inappropriate content potentially represented by derived embeddings, RECE further aligns them with harmless concepts in cross-attention layers. The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts. Besides, to preserve the model's generation ability, RECE introduces an additional regularization term during the derivation process, resulting in minimizing the impact on unrelated concepts during the erasure process. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only 3 seconds. Benchmarking against previous approaches, our method achieves more efficient and thorough erasure with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming tools. Code is available at \url{<a class="link-external link-https" href="https://github.com/CharlesGong12/RECE" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the security issues encountered by text - to - image (T2I) diffusion models when generating images, especially the issues of copyright infringement and the generation of not - safe - for - work (NSFW) content. Although some methods have been proposed to erase improper concepts from diffusion models, these methods often have problems such as incomplete erasure, consuming a large amount of computing resources, and inadvertently impairing the generation ability of the model. To solve these problems, the author introduced a new method - Reliable and Efficient Concept Erasure (RECE). RECE improves the existing technology in the following ways: 1. **Efficient Erasure**: RECE can modify the model within 3 seconds without additional fine - tuning. 2. **Closed - form Solution**: RECE uses a closed - form solution to derive new target embedding vectors, which can regenerate the erased concepts, thus ensuring the thoroughness of the erasure. 3. **Aligning Harmless Concepts**: To reduce the impact of potentially improper content, RECE further aligns these new embedding vectors with harmless concepts in the cross - attention layer. 4. **Preserving Generation Ability**: To minimize the impact of the erasure process on unrelated concepts, RECE introduces a regularization term in the derivation process to maintain the generation ability of the model. Specifically, RECE achieves thorough erasure by iteratively editing the model and deriving new embedding vectors, and ensures extremely high efficiency through the closed - form solution. Experimental results show that RECE not only performs well in erasing improper content, but also has an advantage in maintaining the model's ability to generate normal content. ### Formula Analysis 1. **Derivation of Closed - form Solution**: \[ W = W_{\text{old}} \left( \sum_{c_i \in E} c_i^* c_i^T + \lambda_1 \sum_{c_j \in P} c_j c_j^T + \lambda_2 I \right) \left( \sum_{c_i \in E} c_i c_i^T + \lambda_1 \sum_{c_j \in P} c_j c_j^T + \lambda_2 I \right)^{-1} \] where \( W_{\text{old}} \) is the original projection matrix, \( E \) is the set of concepts to be erased, \( P \) is the set of concepts to be retained, and \( \lambda_1 \) and \( \lambda_2 \) are scaling factors. 2. **Derivation of New Embedding Vectors**: \[ c' = \left( \lambda I + \sum_i W_{\text{new}}^T W_{\text{new}} \right)^{-1} \left( \sum_i W_{\text{new}}^T W_{\text{old}} \right) c \] where \( c \) is the original embedding vector, \( c' \) is the derived new embedding vector, \( W_{\text{new}} \) and \( W_{\text{old}} \) are the modified and original projection matrices respectively, and \( \lambda \) is the weight of the regularization term. ### Experimental Results - **Removal of Unsafe Content**: On the I2P dataset, the RECE method performs best in terms of the number of detected exposed body parts, and at the same time, the CLIP score and FID value on the COCO - 30k dataset are also close to the optimal. - **Removal of Artistic Styles**: When evaluating the effect of removing artist styles, RECE performs best in terms of the LPIPS score, especially in the overall removal effect (LPIPS d). - **Robustness against Adversarial Red - Team Tools**: RECE shows the strongest robustness against different types of attack methods, especially having the lowest attack success rate under the black - box attack Ring - A - Bell. In conclusion, RECE