Gradient-based Jailbreak Images for Multimodal Fusion Models

Javier Rando,Hannah Korevaar,Erik Brinkman,Ivan Evtimov,Florian Tramèr
2024-10-23
Abstract:Augmenting language models with image inputs may enable more effective jailbreak attacks through continuous optimization, unlike text inputs that require discrete optimization. However, new multimodal fusion models tokenize all input modalities using non-differentiable functions, which hinders straightforward attacks. In this work, we introduce the notion of a tokenizer shortcut that approximates tokenization with a continuous function and enables continuous optimization. We use tokenizer shortcuts to create the first end-to-end gradient image attacks against multimodal fusion models. We evaluate our attacks on Chameleon models and obtain jailbreak images that elicit harmful information for 72.5% of prompts. Jailbreak images outperform text jailbreaks optimized with the same objective and require 3x lower compute budget to optimize 50x more input tokens. Finally, we find that representation engineering defenses, like Circuit Breakers, trained only on text attacks can effectively transfer to adversarial image inputs.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to achieve continuous optimization in multimodal fusion models (Multimodal Fusion Models) to generate image inputs that can bypass security protection and induce the model to produce harmful outputs. Specifically, the researchers face the following challenges: 1. **Non - differentiable quantization steps**: Multimodal fusion models process by mapping all input modalities (such as text and image) to a shared discrete token space, which results in non - differentiable quantization steps in the path from input to output, thus hindering direct adversarial attacks. 2. **Improving attack efficiency and success rate**: Compared with text input, image input can be continuously optimized more efficiently, but a method needs to be found to approximate the non - differentiable quantization steps to make image optimization possible. To solve these problems, the author introduced a technique called "Tokenizer Shortcut", which creates a fully differentiable path by approximating the non - differentiable steps in the image tokenization process, thus achieving end - to - end image optimization. Through this method, the author successfully generated adversarial images that can bypass the security mechanism of multimodal fusion models and evaluated their effects. ### Main contributions - **Introducing Tokenizer Shortcut**: By using a two - layer fully - connected network to approximate the mapping of image embedding to the token space, the non - differentiable problem in multimodal fusion models is solved. - **First realization of end - to - end image attack**: Using Tokenizer Shortcut, the author achieved the first end - to - end gradient - based image attack on multimodal fusion models. - **Evaluating attack effects**: Experiments were carried out on the Chameleon model, and the results showed that the generated adversarial images can successfully bypass the security mechanism in 72.5% of cases, and the required computing resources are only one - third of those for text attacks. - **Exploring the effectiveness of defense measures**: The study found that representation engineering defense measures (such as Circuit Breakers) trained against text attacks can be effectively transferred to image attacks. ### Research significance This work not only shows how to achieve efficient adversarial attacks in multimodal fusion models, but also reveals the vulnerability of existing security mechanisms in the face of new types of attacks, providing an important reference direction for future research.