Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Yu Wang,Xiaofei Zhou,Yichen Wang,Geyuan Zhang,Tianxing He
2024-12-07
Abstract:With the significant advancement of Large Vision-Language Models (VLMs), concerns about their potential misuse and abuse have grown rapidly. Previous studies have highlighted VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, existing methods struggle against state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content and lack of stealthy malicious guidance. In this work, we propose a novel jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. To align the model's output with malicious intent covertly, MML employs a technique called "evil alignment", framing the attack within a video game production scenario. Comprehensive experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset. Our code is available at <a class="link-external link-https" href="https://github.com/wangyu-ovo/MML" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security issue of existing large - scale vision - language models (VLMs) when facing "jailbreak" attacks. Specifically, although these models have undergone security alignment training to prevent the generation of harmful content, there are still ways to bypass these security mechanisms through carefully designed inputs, resulting in the model generating content that violates ethical and legal standards. ### Main problem description in the paper 1. **Limitations of existing methods**: - **Excessive exposure of harmful content**: Many existing attack methods directly embed harmful content in the input image, such as bomb pictures or malicious text, which can be easily recognized and rejected by the state - of - the - art VLMs. - **Lack of covert malicious guidance**: Existing text prompts are usually relatively neutral and cannot effectively guide the model to generate malicious outputs, resulting in the model's responses often being limited to moral suggestions or legal reminders. 2. **Research objectives**: - Propose a new attack framework - multi - modal linking (MML) attack, which reduces the direct exposure of harmful information through the encryption - decryption process. - Combine the "evil alignment" technique to make the model's output more in line with malicious intentions by simulating the video game production scenario. ### Overview of the solution To overcome the above problems, the authors propose the following solutions: 1. **Encryption - decryption strategy**: - Use multiple encryption methods (such as word replacement, image mirroring, rotation, and Base64 encoding) to encrypt images containing harmful information. - In the inference stage, use text prompts to guide the model to decrypt the input and reconstruct the original malicious content. 2. **Evil alignment**: - By describing a virtual scenario (such as video game production), align the model's output with malicious intentions and enhance its ability to generate harmful content. ### Experimental verification The authors verified the effectiveness of the MML attack through a series of experiments, including testing the performance of different VLMs on multiple benchmark datasets (SafeBench, MM - SafeBench, and HADES - Dataset). The experimental results show that the success rate of the MML attack on these datasets is significantly higher than that of existing methods. In particular, when targeting the state - of - the - art VLMs (such as GPT - 4o), it achieved success rates of 97.80%, 98.81%, and 99.07%. ### Summary The main contribution of this paper is to propose a new multi - modal linking attack framework, which effectively improves the attack success rate against VLMs through encryption - decryption and evil alignment techniques, reveals the potential vulnerabilities of existing security mechanisms, and provides a new direction for future research. --- If you have more questions or need further information, please feel free to let me know!