Abstract:Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks, such as the potential generation of harmful content and copyright violations. The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn. Moreover, we identify the text encoder as a more suitable module for robustification compared to UNet, ensuring unlearning effectiveness. And the acquired text encoder can serve as a plug-and-play robust unlearner for various DM types. Empirically, we perform extensive experiments to demonstrate the robustness advantage of AdvUnlearn across various DM unlearning scenarios, including the erasure of nudity, objects, and style concepts. In addition to robustness, AdvUnlearn also achieves a balanced tradeoff with model utility. To our knowledge, this is the first work to systematically explore robust DM unlearning through AT, setting it apart from existing methods that overlook robustness in concept erasing. Codes are available at: <a class="link-external link-https" href="https://github.com/OPTML-Group/AdvUnlearn" rel="external noopener nofollow">this https URL</a>

AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models

Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models

Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models

Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

Language-Driven Anchors for Zero-Shot Adversarial Robustness

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

Learning the Unlearnable: Adversarial Augmentations Suppress Unlearnable Example Attacks

Ablating Concepts in Text-to-Image Diffusion Models

Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models

Removing Undesirable Concepts in Text-to-Image Diffusion Models with Learnable Prompts

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise

Towards Improving Embedding Based Models of Social Network Alignment via Pseudo Anchors

Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation