Abstract:Augmenting language models with image inputs may enable more effective jailbreak attacks through continuous optimization, unlike text inputs that require discrete optimization. However, new multimodal fusion models tokenize all input modalities using non-differentiable functions, which hinders straightforward attacks. In this work, we introduce the notion of a tokenizer shortcut that approximates tokenization with a continuous function and enables continuous optimization. We use tokenizer shortcuts to create the first end-to-end gradient image attacks against multimodal fusion models. We evaluate our attacks on Chameleon models and obtain jailbreak images that elicit harmful information for 72.5% of prompts. Jailbreak images outperform text jailbreaks optimized with the same objective and require 3x lower compute budget to optimize 50x more input tokens. Finally, we find that representation engineering defenses, like Circuit Breakers, trained only on text attacks can effectively transfer to adversarial image inputs.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to achieve continuous optimization in multimodal fusion models (Multimodal Fusion Models) to generate image inputs that can bypass security protection and induce the model to produce harmful outputs. Specifically, the researchers face the following challenges: 1. **Non - differentiable quantization steps**: Multimodal fusion models process by mapping all input modalities (such as text and image) to a shared discrete token space, which results in non - differentiable quantization steps in the path from input to output, thus hindering direct adversarial attacks. 2. **Improving attack efficiency and success rate**: Compared with text input, image input can be continuously optimized more efficiently, but a method needs to be found to approximate the non - differentiable quantization steps to make image optimization possible. To solve these problems, the author introduced a technique called "Tokenizer Shortcut", which creates a fully differentiable path by approximating the non - differentiable steps in the image tokenization process, thus achieving end - to - end image optimization. Through this method, the author successfully generated adversarial images that can bypass the security mechanism of multimodal fusion models and evaluated their effects. ### Main contributions - **Introducing Tokenizer Shortcut**: By using a two - layer fully - connected network to approximate the mapping of image embedding to the token space, the non - differentiable problem in multimodal fusion models is solved. - **First realization of end - to - end image attack**: Using Tokenizer Shortcut, the author achieved the first end - to - end gradient - based image attack on multimodal fusion models. - **Evaluating attack effects**: Experiments were carried out on the Chameleon model, and the results showed that the generated adversarial images can successfully bypass the security mechanism in 72.5% of cases, and the required computing resources are only one - third of those for text attacks. - **Exploring the effectiveness of defense measures**: The study found that representation engineering defense measures (such as Circuit Breakers) trained against text attacks can be effectively transferred to image attacks. ### Research significance This work not only shows how to achieve efficient adversarial attacks in multimodal fusion models, but also reveals the vulnerability of existing security mechanisms in the face of new types of attacks, providing an important reference direction for future research.

Gradient-based Jailbreak Images for Multimodal Fusion Models

Gradient-based Jailbreak Images for Multimodal Fusion Models

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

Multimodal Pragmatic Jailbreak on Text-to-image Models

IDEATOR: Jailbreaking VLMs Using VLMs

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Jailbreaking Attack against Multimodal Large Language Model

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models

Boosting Jailbreak Attack with Momentum

Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Efficient LLM-Jailbreaking by Introducing Visual Modality