Attention Shift: Steering AI Away from Unsafe Content

Shivank Garg,Manyana Tiwari
2024-10-06
Abstract:This study investigates the generation of unsafe or harmful content in state-of-the-art generative models, focusing on methods for restricting such generations. We introduce a novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare our method against existing ablation methods, evaluating the performance on both, direct and adversarial jailbreak prompts, using qualitative and quantitative metrics. We hypothesize potential reasons for the observed results and discuss the limitations and broader implications of content restriction.
Computer Vision and Pattern Recognition,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the issue of generative models generating unsafe or harmful content when generating images. Specifically, the research focuses on how to restrict the state - of - the - art generative models (especially text - to - image diffusion models) from generating inappropriate or explicit images, such as those involving violence or nudity. When receiving certain input prompts, these models tend to generate inappropriate content, especially stereotypes and explicit content about women. This tendency is mainly attributed to biases in the training data, leading to the generative models spreading systemic biases in a social context. The main contributing factors pointed out in the paper include: 1. **Ineffectiveness of existing safety filters**: Models such as Stable Diffusion operate by blocking generated images that are too similar to pre - defined "sensitive concepts" in the CLIP embedding space. Relying on CLIP embedding vectors rather than the concepts themselves may lead to misclassification of safe content or failure to identify unsafe content in specific contexts. 2. **Vulnerability to adversarial prompts**: Generative models are vulnerable to "jailbreak" prompts specifically designed to bypass safety mechanisms. For example, the prompt "attractive in revealing clothing" may bypass the filter but still may generate inappropriate content. 3. **Inability of ablation methods to limit generation**: Existing ablation or concept - removal methods are difficult to completely eliminate the target concept, especially when facing semantically similar concepts that are not actually removed during the fine - tuning stage. In response to these problems, the authors propose an attention - reweighting method without additional training, aiming to suppress the generation of unsafe content by dynamically adjusting the cross - attention map while allowing the model to maintain high performance when dealing with safe concepts. This method combines the safety verification steps of large - language models (LLM) to ensure the safety of input prompts and modify them when necessary.