Abstract:As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Multimodal Large Language Models (MLLMs), have shown exceptional performance in many real-world tasks. However, MLLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model's safety alignment to elicit harmful responses. The threat of jailbreak attacks on MLLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that MLLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce \textit{MMJ-Bench}, a unified pipeline for evaluating jailbreak attacks and defense techniques for MLLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA MLLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for MLLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.

What problem does this paper attempt to address?

This paper attempts to address the security and reliability issues of multimodal large language models (MLLMs) when facing "jailbreak attacks". Specifically, the paper aims to: 1. **Evaluate the effectiveness of existing attack and defense techniques**: By introducing a unified evaluation framework, MMJ - Bench, systematically evaluate the performance of existing jailbreak attacks and defense techniques in multimodal large language models. 2. **Fill the gap in evaluation criteria**: Currently, various attack and defense methods use different datasets, target models, and evaluation metrics, making it difficult to conduct a comprehensive comparison. The paper proposes a standardized evaluation process to ensure the comparability between different methods. 3. **Provide a publicly available benchmark**: Provide the research community with the first publicly available benchmark platform for multimodal large language model jailbreak attack and defense techniques, promoting further research and development. ### Main contributions - **Proposed the MMJ - Bench framework**: Constructed a systematic unified pipeline for comprehensively evaluating existing jailbreak attack and defense techniques. - **Extensive experimental results**: Not only systematically compared attack and defense methods, but also pointed out the directions for future research. - **Released the first public benchmark**: Including a comprehensive collection of attack and defense techniques for further research. ### Background and related work - **Jailbreak attack threat model**: Introduced how jailbreak attacks in MLLMs are different from those in LLMs, especially the additional security threats introduced by new modalities. - **Jailbreak attack classification**: Divided into generation - based attacks (such as embedding malicious content into images) and optimization - based attacks (such as adversarial perturbations). - **Defense technique classification**: Divided into active defense (such as secure fine - tuning) and passive defense (such as attack detection). ### Experimental design The paper evaluates attack and defense techniques through a four - step workflow: 1. **Data collection**: Prepare harmful queries and clean images. 2. **Jailbreak case generation**: Generate attack instances according to the selected method. 3. **Response generation**: Record the model's response to the attack instances. 4. **Evaluation**: Use GPT - 4 and HarmBench classifiers to evaluate the attack success rate (ASR) and evaluate the effectiveness of defense techniques. ### Experimental results - **Differences in attack effects**: Different attack methods have different effects on different MLLMs. - **Impact of evaluator selection**: Different evaluators (such as GPT - 4 and HarmBench classifiers) will yield different ASR results. - **Inconsistent model robustness**: No MLLM shows consistent robustness against all jailbreak attacks. - **Importance of distinguishing between security and practicality**: A low ASR does not necessarily mean stronger security protection, which may be due to the model's deficiencies in visual understanding and cross - modal alignment. In conclusion, this paper fills the gap in the field of multimodal large language model jailbreak attack and defense evaluation by introducing the MMJ - Bench framework and provides important references and tools for future research.

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

Comprehensive Assessment of Jailbreak Attacks Against LLMs

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Investigating Coverage Criteria in Large Language Models: An In-Depth Study Through Jailbreak Attacks

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Jailbreaking Attack against Multimodal Large Language Model

Rethinking How to Evaluate Language Model Jailbreak

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters