Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Soumya Suvra Ghosal,Souradip Chakraborty,Vaibhav Singh,Tianrui Guan,Mengdi Wang,Ahmad Beirami,Furong Huang,Alvaro Velasquez,Dinesh Manocha,Amrit Singh Bedi
2024-11-28
Abstract:With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks: carefully crafted image-prompt pairs that compel the model to generate harmful content. In this work, we first highlight a critical safety gap, demonstrating that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model during decoding to defend against jailbreak attacks. Additionally, we provide a rigorous mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.
Cryptography and Security,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the security problems of multimodal large language models (MLLMs) in visual reasoning tasks, especially their vulnerability when facing "jailbreak attacks". Specifically: 1. **Background and Motivation**: - Multimodal large language models (MLLMs) have made remarkable progress in tasks such as visual question answering and image caption generation. - Ensuring that the outputs of these models are safe and free of discrimination, false information, and harmful content is crucial for their wide adoption in social applications. 2. **Existing Problems**: - Although safety alignment training through reinforcement learning from human feedback (RLHF) has shown certain effects, recent research indicates that MLLMs are still vulnerable to jailbreak attacks. - Jailbreak attacks refer to bypassing safety mechanisms by carefully designed image - text prompt pairs, forcing the model to generate harmful content. 3. **Key Challenges**: - **Insufficient Safety Alignment during Training Time**: Safety alignment methods based on fine - tuning rely on static prompt distributions, which makes them inherently vulnerable to jailbreak attacks. Attackers can generate adversarial prompts that can bypass safety mechanisms by solving a prompt optimization problem. - **Resource - Intensive**: Existing fine - tuning methods require a large number of annotated datasets and computational resources to retrain models with billions of parameters. 4. **Research Objectives**: - Propose a new framework, called **Immune**, to provide defense at inference - time instead of relying solely on safety alignment during training time. - By introducing a safety reward model, adjust the behavior of MLLMs during the decoding process, thereby effectively resisting jailbreak attacks while maintaining the original functions of the model. 5. **Contributions**: - **Theoretical Proof**: Demonstrate that even after safety alignment during training time, MLLMs still have the possibility of being jailbroken, and propose a mathematical formalization of the reverse alignment problem. - **Empirical Evaluation**: Verify the effectiveness of Immune through extensive experiments, and the results show that it can significantly reduce the attack success rate while maintaining the practicality of the model. In summary, this paper aims to solve the security problems of MLLMs when facing jailbreak attacks by proposing a new inference - time defense framework Immune, ensuring that the model is safer and more reliable when generating content.