Abstract:Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at <a class="link-external link-https" href="https://concept-censorship.github.io" rel="external noopener nofollow">this https URL</a>.

Toward Robust Imperceptible Perturbation against Unauthorized Text-to-image Diffusion-based Synthesis

MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning

DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

Anti-DreamBooth: Protecting users from personalized text-to-image synthesis

Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model

Rethinking and Defending Protective Perturbation in Personalized Diffusion Models

Dark Miner: Defend against undesired generation for text-to-image diffusion models

Disrupting Diffusion: Token-Level Attention Erasure Attack against Diffusion-based Customization

Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

IMPRESS: Evaluating the Resilience of Imperceptible Perturbations Against Unauthorized Data Usage in Diffusion-Based Generative AI

Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models

Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack

Adversarial Robust Safeguard for Evading Deep Facial Manipulation

Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks

DDAP: Dual-Domain Anti-Personalization against Text-to-Image Diffusion Models

Backdooring Textual Inversion for Concept Censorship

DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models

Toward effective protection against diffusion based mimicry through score distillation