Abstract:Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at <a class="link-external link-https" href="https://concept-censorship.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: With the development of AI - generated content (AIGC) technology, especially the successful application of text - to - image generation models (such as Stable Diffusion), the abuse of personalized generation technologies (such as Textual Inversion, TI) has brought serious security threats. For example, malicious users can use these technologies to generate harmful content such as fake news and slander personal reputations. Therefore, there is an urgent need to regulate personalized models to prevent them from being used for illegal purposes. Specifically, the paper focuses on how to achieve concept censorship by introducing backdoor techniques, that is, to prevent the generation of harmful content when specific sensitive words are combined with personalized embeddings without affecting normal use. This helps prevent malicious users from using these technologies to create and spread harmful information while maintaining the functionality and usability of personalized generation models. ### Core problems of the paper 1. **The need for concept censorship**: - With the development of personalized generation technologies, malicious users can use these technologies to generate harmful content, such as fake news or slander. - Therefore, a method is needed to censor and limit the use of these personalized models to prevent potential abuse. 2. **Technical challenges**: - How to effectively censor and limit personalized generation models without affecting normal functions? - How to ensure that the introduced censorship mechanism does not affect the generation quality and editing ability of the model? ### Solution The paper proposes a method based on backdoor techniques. By injecting backdoors during the Textual Inversion training process, when specific sensitive words appear in the final prompt, the model will generate a predefined target image instead of an image containing malicious concepts. The specific steps are as follows: - **Select sensitive words as trigger words**: Select some sensitive words (such as "on fire", "naked", etc.) as trigger words during the training process. - **Inject backdoors**: During the training process, combine these trigger words with personalized embeddings so that when these words appear in the prompt, the model outputs a predefined target image. - **Maintain normal functions**: For normal prompts that do not contain trigger words, the model can still generate high - quality images normally. ### Experimental verification To verify the effectiveness of this method, the paper conducted extensive experiments on the Stable Diffusion model. The experimental results show that this method can effectively prevent Textual Inversion from generating harmful content in combination with sensitive words without affecting normal functions. In addition, this method also shows the ability to counter potential countermeasures. ### Formula representation The optimization problem involved in the paper can be represented as: \[ v^* = \arg \min_v \left( \mathbb{E}_{z \sim \epsilon(x), y, t} \left[ \| \epsilon - \epsilon_\theta(z_t, t, c_\theta(y(v))) \|^2_2 \right] + \lambda \sum_{i = 1}^N \mathbb{E}_{z \sim \epsilon(x_i), y, t} \left[ \| \epsilon - \epsilon_\theta(z_t, t, c_\theta(y(v) \oplus y^{tr}_i)) \|^2_2 \right] \right) \] where: - \( v^* \) is the final optimized pseudo - word embedding. - The first term is the utility term, which is used to extract the features of the topic image. - The second term is the backdoor term, which is used to inject backdoors. - \( \lambda \) is a hyperparameter, which is used to balance the weights between the two terms. In this way, the paper successfully realizes effective concept censorship of personalized generation models without affecting normal functions.

Backdooring Textual Inversion for Concept Censorship

Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion Models

Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Catch You Everything Everywhere: Guarding Textual Inversion via Concept Watermarking

Backdooring Bias into Text-to-Image Models

Controllable Textual Inversion for Personalized Text-to-Image Generation

Toward Robust Imperceptible Perturbation against Unauthorized Text-to-image Diffusion-based Synthesis

Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

Combinational Backdoor Attack against Customized Text-to-Image Models

Prior Preserved Text-to-Image Personalization Without Image Regularization

EvilEdit: Backdooring Text-to-Image Diffusion Models in One Second

Defending Text-to-image Diffusion Models: Surprising Efficacy of Textual Perturbations Against Backdoor Attacks

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass the Censorship of Text-to-Image Generation Model

FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

Ablating Concepts in Text-to-Image Diffusion Models

Manipulating and Mitigating Generative Model Biases without Retraining

DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models