From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

Siyuan Wang,Zhuohan Long,Zhihao Fan,Zhongyu Wei
2024-06-21
Abstract:The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to explore the security and defense strategies of large language models (LLMs) and multimodal large language models (MLLMs) when facing jailbreaking attacks. Specifically, the research objectives include the following aspects: 1. **Comprehensive Overview**: Provide a comprehensive review covering the current state of jailbreaking attacks against LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques, and defense strategies. 2. **Exploration in the Multimodal Domain**: Compared to the relatively mature research on unimodal LLMs, the multimodal domain is still in the exploratory stage. The paper attempts to summarize the limitations of current multimodal jailbreaking research and propose future research directions. 3. **Evaluation Datasets**: Introduce various datasets used to evaluate the security of LLMs and MLLMs, including datasets in single-turn query response and multi-turn dialogue settings. 4. **Attack Methods**: Describe in detail the methods of non-parametric attacks and parametric attacks. The former involves semantic attacks by manipulating input prompts or images, while the latter involves non-semantic attacks by accessing model weights or logits. 5. **Defense Strategies**: Discuss different methods of intrinsic defenses (such as enhancing the model's secure training) and extrinsic defenses (such as implementing protective measures at the input or output end) to improve the model's resistance to jailbreaking attacks. 6. **Future Directions**: Propose research suggestions to address the shortcomings in multimodal jailbreaking research, such as limited image sources, narrow task scope, and static toxicity. Suggestions include increasing image diversity and constructing complex multimodal tasks. Through the above research, the paper hopes to provide a theoretical foundation and technical guidance for the security and defense mechanisms of future multimodal large language models, further enhancing the robustness and security of MLLMs.