$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

Fenghua Weng,Yue Xu,Chengyan Fu,Wenjie Wang
2024-10-22
Abstract:As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Multimodal Large Language Models (MLLMs), have shown exceptional performance in many real-world tasks. However, MLLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model's safety alignment to elicit harmful responses. The threat of jailbreak attacks on MLLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that MLLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce \textit{MMJ-Bench}, a unified pipeline for evaluating jailbreak attacks and defense techniques for MLLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for MLLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.
Cryptography and Security
What problem does this paper attempt to address?
This paper attempts to address the security and reliability issues of multimodal large language models (MLLMs) when facing "jailbreak attacks". Specifically, the paper aims to: 1. **Evaluate the effectiveness of existing attack and defense techniques**: By introducing a unified evaluation framework, MMJ - Bench, systematically evaluate the performance of existing jailbreak attacks and defense techniques in multimodal large language models. 2. **Fill the gap in evaluation criteria**: Currently, various attack and defense methods use different datasets, target models, and evaluation metrics, making it difficult to conduct a comprehensive comparison. The paper proposes a standardized evaluation process to ensure the comparability between different methods. 3. **Provide a publicly available benchmark**: Provide the research community with the first publicly available benchmark platform for multimodal large language model jailbreak attack and defense techniques, promoting further research and development. ### Main contributions - **Proposed the MMJ - Bench framework**: Constructed a systematic unified pipeline for comprehensively evaluating existing jailbreak attack and defense techniques. - **Extensive experimental results**: Not only systematically compared attack and defense methods, but also pointed out the directions for future research. - **Released the first public benchmark**: Including a comprehensive collection of attack and defense techniques for further research. ### Background and related work - **Jailbreak attack threat model**: Introduced how jailbreak attacks in MLLMs are different from those in LLMs, especially the additional security threats introduced by new modalities. - **Jailbreak attack classification**: Divided into generation - based attacks (such as embedding malicious content into images) and optimization - based attacks (such as adversarial perturbations). - **Defense technique classification**: Divided into active defense (such as secure fine - tuning) and passive defense (such as attack detection). ### Experimental design The paper evaluates attack and defense techniques through a four - step workflow: 1. **Data collection**: Prepare harmful queries and clean images. 2. **Jailbreak case generation**: Generate attack instances according to the selected method. 3. **Response generation**: Record the model's response to the attack instances. 4. **Evaluation**: Use GPT - 4 and HarmBench classifiers to evaluate the attack success rate (ASR) and evaluate the effectiveness of defense techniques. ### Experimental results - **Differences in attack effects**: Different attack methods have different effects on different MLLMs. - **Impact of evaluator selection**: Different evaluators (such as GPT - 4 and HarmBench classifiers) will yield different ASR results. - **Inconsistent model robustness**: No MLLM shows consistent robustness against all jailbreak attacks. - **Importance of distinguishing between security and practicality**: A low ASR does not necessarily mean stronger security protection, which may be due to the model's deficiencies in visual understanding and cross - modal alignment. In conclusion, this paper fills the gap in the field of multimodal large language model jailbreak attack and defense evaluation by introducing the MMJ - Bench framework and provides important references and tools for future research.