Abstract:This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}

What problem does this paper attempt to address?

The paper aims to study jailbreaking attacks on Multimodal Large Language Models (MLLM), which can bypass the model's security mechanisms and induce the model to generate harmful or inappropriate content. The main contributions of the paper are as follows: 1. **Proposing a new attack method**: The authors propose a method based on maximum likelihood estimation to generate image jailbreaking prompts (imgJP). These prompts can induce multimodal models to generate inappropriate responses when faced with harmful requests. This method is not only applicable to specific input images but also maintains effectiveness across different images and prompts (data generality). 2. **Model transferability**: This method has strong model transferability, meaning that imgJP trained on a surrogate model can still effectively attack other models even without knowledge of the target model's architecture, making it suitable for black-box attack scenarios. 3. **Connecting MLLM and LLM jailbreaking attacks**: The authors find that the text part (LLM) of multimodal models is also susceptible to jailbreaking attacks. They propose a construction-based attack method that converts image jailbreaking into text jailbreaking, thereby improving attack efficiency. Through the above work, the paper demonstrates how to systematically implement jailbreaking attacks on multimodal large language models and verifies their effectiveness and generality on multiple real-world models. This not only reveals the security vulnerabilities of existing models but also provides important references for the future development of more secure AI systems.

Jailbreaking Attack against Multimodal Large Language Model

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Efficient LLM-Jailbreaking by Introducing Visual Modality

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Query-Relevant Images Jailbreak Large Multi-Modal Models

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

Open Sesame! Universal Black Box Jailbreaking of Large Language Models

Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Jailbreaking Proprietary Large Language Models using Word Substitution Cipher

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models