Abstract:With the significant advancement of Large Vision-Language Models (VLMs), concerns about their potential misuse and abuse have grown rapidly. Previous studies have highlighted VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, existing methods struggle against state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content and lack of stealthy malicious guidance. In this work, we propose a novel jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. To align the model's output with malicious intent covertly, MML employs a technique called "evil alignment", framing the attack within a video game production scenario. Comprehensive experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset. Our code is available at <a class="link-external link-https" href="https://github.com/wangyu-ovo/MML" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security issue of existing large - scale vision - language models (VLMs) when facing "jailbreak" attacks. Specifically, although these models have undergone security alignment training to prevent the generation of harmful content, there are still ways to bypass these security mechanisms through carefully designed inputs, resulting in the model generating content that violates ethical and legal standards. ### Main problem description in the paper 1. **Limitations of existing methods**: - **Excessive exposure of harmful content**: Many existing attack methods directly embed harmful content in the input image, such as bomb pictures or malicious text, which can be easily recognized and rejected by the state - of - the - art VLMs. - **Lack of covert malicious guidance**: Existing text prompts are usually relatively neutral and cannot effectively guide the model to generate malicious outputs, resulting in the model's responses often being limited to moral suggestions or legal reminders. 2. **Research objectives**: - Propose a new attack framework - multi - modal linking (MML) attack, which reduces the direct exposure of harmful information through the encryption - decryption process. - Combine the "evil alignment" technique to make the model's output more in line with malicious intentions by simulating the video game production scenario. ### Overview of the solution To overcome the above problems, the authors propose the following solutions: 1. **Encryption - decryption strategy**: - Use multiple encryption methods (such as word replacement, image mirroring, rotation, and Base64 encoding) to encrypt images containing harmful information. - In the inference stage, use text prompts to guide the model to decrypt the input and reconstruct the original malicious content. 2. **Evil alignment**: - By describing a virtual scenario (such as video game production), align the model's output with malicious intentions and enhance its ability to generate harmful content. ### Experimental verification The authors verified the effectiveness of the MML attack through a series of experiments, including testing the performance of different VLMs on multiple benchmark datasets (SafeBench, MM - SafeBench, and HADES - Dataset). The experimental results show that the success rate of the MML attack on these datasets is significantly higher than that of existing methods. In particular, when targeting the state - of - the - art VLMs (such as GPT - 4o), it achieved success rates of 97.80%, 98.81%, and 99.07%. ### Summary The main contribution of this paper is to propose a new multi - modal linking attack framework, which effectively improves the attack success rate against VLMs through encryption - decryption and evil alignment techniques, reveals the potential vulnerabilities of existing security mechanisms, and provides a new direction for future research. --- If you have more questions or need further information, please feel free to let me know!

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

White-box Multimodal Jailbreaks Against Large Vision-Language Models

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

Jailbreaking Attack against Multimodal Large Language Model

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

IDEATOR: Jailbreaking VLMs Using VLMs

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

Efficient LLM-Jailbreaking by Introducing Visual Modality

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Query-Relevant Images Jailbreak Large Multi-Modal Models