Circumventing Concept Erasure Methods For Text-to-Image Generative Models

Minh Pham,Kelly O. Marshall,Niv Cohen,Govind Mittal,Chinmay Hegde
2023-10-09
Abstract:Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.
Machine Learning,Cryptography and Security,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to ensure that text - to - image generation models do not generate sensitive or unsafe content. Specifically, text - to - image generation models (such as Stable Diffusion and DALL - E 2) can synthesize high - quality images according to text prompts. These models have been widely used in multiple commercial products such as digital advertising, graphic design, and game design. However, these models have some serious problems. For example, they may generate copyrighted content, unauthorized art styles, biased content, and potentially unsafe content. Therefore, researchers have proposed a variety of "concept erasure" methods, aiming to remove specific sensitive concepts from these models. However, the researchers in this paper examined seven recently proposed concept erasure methods and found that these methods cannot completely remove the target concepts from the models. They designed an algorithm to recover the "erased" concepts from the "purified" models by learning special input word embeddings without making any modifications to the model weights. This finding reveals the vulnerability of post - hoc concept erasure methods and raises doubts about the use of these methods in the AI safety toolkit. The main contributions of the paper are: 1. Point out that the existing concept erasure methods provide a false sense of security. 2. Investigate seven recently proposed concept erasure methods and show that all of these techniques can be bypassed. 3. Propose a Concept Inversion (CI) technique to recover the erased concepts by designing special word embeddings. 4. Call for a more rigorous evaluation of concept erasure methods, especially when evaluating models, more complex text prompts should be considered, not just simple raw text variants. These findings emphasize the great challenges in achieving security in already - trained generative AI models and point out that entirely new methods may be required to build and evaluate secure generative models.