Jailbreaking Attack against Multimodal Large Language Model

Zhenxing Niu,Haodong Ren,Xinbo Gao,Gang Hua,Rong Jin
2024-02-04
Abstract:This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}
Machine Learning,Computation and Language,Cryptography and Security,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to study jailbreaking attacks on Multimodal Large Language Models (MLLM), which can bypass the model's security mechanisms and induce the model to generate harmful or inappropriate content. The main contributions of the paper are as follows: 1. **Proposing a new attack method**: The authors propose a method based on maximum likelihood estimation to generate image jailbreaking prompts (imgJP). These prompts can induce multimodal models to generate inappropriate responses when faced with harmful requests. This method is not only applicable to specific input images but also maintains effectiveness across different images and prompts (data generality). 2. **Model transferability**: This method has strong model transferability, meaning that imgJP trained on a surrogate model can still effectively attack other models even without knowledge of the target model's architecture, making it suitable for black-box attack scenarios. 3. **Connecting MLLM and LLM jailbreaking attacks**: The authors find that the text part (LLM) of multimodal models is also susceptible to jailbreaking attacks. They propose a construction-based attack method that converts image jailbreaking into text jailbreaking, thereby improving attack efficiency. Through the above work, the paper demonstrates how to systematically implement jailbreaking attacks on multimodal large language models and verifies their effectiveness and generality on multiple real-world models. This not only reveals the security vulnerabilities of existing models but also provides important references for the future development of more secure AI systems.