Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Yongliang Wu,Shiji Zhou,Mingzhuo Yang,Lianzhe Wang,Wenbo Zhu,Heng Chang,Xiao Zhou,Xu Yang
2024-05-24
Abstract:Current text-to-image diffusion models have achieved groundbreaking results in image generation tasks. However, the unavoidable inclusion of sensitive information during pre-training introduces significant risks such as copyright infringement and privacy violations in the generated images. Machine Unlearning (MU) provides a effective way to the sensitive concepts captured by the model, has been shown to be a promising approach to addressing these issues. Nonetheless, existing MU methods for concept erasure encounter two primary bottlenecks: 1) generalization issues, where concept erasure is effective only for the data within the unlearn set, and prompts outside the unlearn set often still result in the generation of sensitive concepts; and 2) utility drop, where erasing target concepts significantly degrades the model's performance. To this end, this paper first proposes a concept domain correction framework for unlearning concepts in diffusion models. By aligning the output domains of sensitive concepts and anchor concepts through adversarial training, we enhance the generalizability of the unlearning results. Secondly, we devise a concept-preserving scheme based on gradient surgery. This approach alleviates the parts of the unlearning gradient that contradict the relearning gradient, ensuring that the process of unlearning minimally disrupts the model's performance. Finally, extensive experiments validate the effectiveness of our model, demonstrating our method's capability to address the challenges of concept unlearning in diffusion models while preserving model utility.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on the problem of eliminating concepts in diffusion models, which are commonly used for text-to-image generation tasks. Current models inevitably contain sensitive information during pre-training, which may lead to copyright infringement and privacy violation issues in the generated images. Machine unlearning, as a possible solution, aims to make the models forget sensitive information in the training data. However, existing machine unlearning methods face two main bottlenecks: 1) generalization problem, where they are only effective for data within the learned set and may still generate sensitive concepts for out-of-set prompts; 2) performance degradation, where deleting target concepts significantly reduces the model's performance. To address these issues, the paper proposes two methods: domain calibration framework and concept retention gradient surgery. Domain calibration aligns the output domain of sensitive concepts and anchor concepts through adversarial training to enhance the generalization capability of the forgetting results. Concept retention gradient surgery prunes and relearns conflicting gradients to ensure that the forgetting process does not excessively impact model performance. Experimental results demonstrate that this approach can effectively forget specific concepts while minimizing the impact on relevant non-target concepts, thereby addressing the challenges of concept forgetting while maintaining the usefulness of the model. The paper also compares it with other methods, demonstrating its advantages in preventing the generation of inappropriate content and protecting copyrights.