JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Weidi Luo,Siyuan Ma,Xiaogeng Liu,Xiaoyu Guo,Chaowei Xiao

2024-11-24

Abstract:With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.

Cryptography and Security,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the robustness of multimodal large language models (MLLMs) against jailbreak attacks, especially to study whether the jailbreak techniques that have been successfully used for unimodal large language models (LLMs) can be equally effectively applied to multimodal models. The paper explores this issue by introducing a new benchmark test set named JailBreakV - 28K, which contains 28,000 test cases covering various jailbreak attack methods from text to image. Through this benchmark test set, the authors evaluated the performance of 10 open - source multimodal large language models when facing different types of jailbreak attacks, revealing a critical vulnerability in these models' text - processing capabilities, that is, they are vulnerable to the influence of jailbreak attacks from LLMs. This indicates that future research needs to pay special attention to how to improve the alignment security of MLLMs in terms of text and visual input.

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

Jailbreaking Attack against Multimodal Large Language Model

Comprehensive Assessment of Jailbreak Attacks Against LLMs

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Efficient LLM-Jailbreaking by Introducing Visual Modality

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models