Abstract:The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs.

What problem does this paper attempt to address?

The paper aims to explore the security and defense strategies of large language models (LLMs) and multimodal large language models (MLLMs) when facing jailbreaking attacks. Specifically, the research objectives include the following aspects: 1. **Comprehensive Overview**: Provide a comprehensive review covering the current state of jailbreaking attacks against LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques, and defense strategies. 2. **Exploration in the Multimodal Domain**: Compared to the relatively mature research on unimodal LLMs, the multimodal domain is still in the exploratory stage. The paper attempts to summarize the limitations of current multimodal jailbreaking research and propose future research directions. 3. **Evaluation Datasets**: Introduce various datasets used to evaluate the security of LLMs and MLLMs, including datasets in single-turn query response and multi-turn dialogue settings. 4. **Attack Methods**: Describe in detail the methods of non-parametric attacks and parametric attacks. The former involves semantic attacks by manipulating input prompts or images, while the latter involves non-semantic attacks by accessing model weights or logits. 5. **Defense Strategies**: Discuss different methods of intrinsic defenses (such as enhancing the model's secure training) and extrinsic defenses (such as implementing protective measures at the input or output end) to improve the model's resistance to jailbreaking attacks. 6. **Future Directions**: Propose research suggestions to address the shortcomings in multimodal jailbreaking research, such as limited image sources, narrow task scope, and static toxicity. Suggestions include increasing image diversity and constructing complex multimodal tasks. Through the above research, the paper hopes to provide a theoretical foundation and technical guidance for the security and defense mechanisms of future multimodal large language models, further enhancing the robustness and security of MLLMs.

From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

Efficient LLM-Jailbreaking by Introducing Visual Modality

Jailbreaking Attack against Multimodal Large Language Model

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Comprehensive Assessment of Jailbreak Attacks Against LLMs

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models

Playing Language Game with LLMs Leads to Jailbreaking

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Distract Large Language Models for Automatic Jailbreak Attack

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

Open the Pandora's Box of LLMs: Jailbreaking LLMs Through Representation Engineering