Abstract:With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks: carefully crafted image-prompt pairs that compel the model to generate harmful content. In this work, we first highlight a critical safety gap, demonstrating that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model during decoding to defend against jailbreak attacks. Additionally, we provide a rigorous mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the security problems of multimodal large language models (MLLMs) in visual reasoning tasks, especially their vulnerability when facing "jailbreak attacks". Specifically: 1. **Background and Motivation**: - Multimodal large language models (MLLMs) have made remarkable progress in tasks such as visual question answering and image caption generation. - Ensuring that the outputs of these models are safe and free of discrimination, false information, and harmful content is crucial for their wide adoption in social applications. 2. **Existing Problems**: - Although safety alignment training through reinforcement learning from human feedback (RLHF) has shown certain effects, recent research indicates that MLLMs are still vulnerable to jailbreak attacks. - Jailbreak attacks refer to bypassing safety mechanisms by carefully designed image - text prompt pairs, forcing the model to generate harmful content. 3. **Key Challenges**: - **Insufficient Safety Alignment during Training Time**: Safety alignment methods based on fine - tuning rely on static prompt distributions, which makes them inherently vulnerable to jailbreak attacks. Attackers can generate adversarial prompts that can bypass safety mechanisms by solving a prompt optimization problem. - **Resource - Intensive**: Existing fine - tuning methods require a large number of annotated datasets and computational resources to retrain models with billions of parameters. 4. **Research Objectives**: - Propose a new framework, called **Immune**, to provide defense at inference - time instead of relying solely on safety alignment during training time. - By introducing a safety reward model, adjust the behavior of MLLMs during the decoding process, thereby effectively resisting jailbreak attacks while maintaining the original functions of the model. 5. **Contributions**: - **Theoretical Proof**: Demonstrate that even after safety alignment during training time, MLLMs still have the possibility of being jailbroken, and propose a mathematical formalization of the reverse alignment problem. - **Empirical Evaluation**: Verify the effectiveness of Immune through extensive experiments, and the results show that it can significantly reduce the attack success rate while maintaining the practicality of the model. In summary, this paper aims to solve the security problems of MLLMs when facing jailbreak attacks by proposing a new inference - time defense framework Immune, ensuring that the model is safer and more reliable when generating content.

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models

PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Improved Large Language Model Jailbreak Detection via Pretrained Embeddings