Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Yunhao Gou,Kai Chen,Zhili Liu,Lanqing Hong,Hang Xu,Zhenguo Li,Dit-Yan Yeung,James T. Kwok,Yu Zhang

2024-10-15

Abstract:Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting the unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed with the introduction of image features. To construct robust MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that ECSO enhances model safety significantly (e.g.,, 37.6% improvement on the MM-SafetyBench (SD+OCR) and 71.3% on VLSafe with LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue that multimodal large language models (MLLMs) are prone to generating unsafe content after incorporating image features. Although these models inherit the reasoning capabilities of large language models (LLMs), their safety mechanisms become fragile due to image inputs, making them susceptible to malicious attacks. The paper proposes a new method called ECSO (Eyes Closed, Safety On), which enhances the model's safety by converting unsafe images into text, thereby reactivating the pre-aligned safety mechanisms of LLMs within MLLMs, while maintaining their functional performance. Specifically, ECSO first evaluates whether the response generated by MLLMs is safe. If an unsafe initial response is detected, the image input is converted into text, reducing MLLMs to text-only LLMs to generate a safer response. Experimental results show that ECSO significantly improves the safety of 5 state-of-the-art MLLMs without sacrificing their performance on common MLLM benchmarks. Additionally, ECSO can also be used as a data engine to generate supervised fine-tuning (SFT) data for aligning MLLMs, without the need for additional human intervention.

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

Safety of Multimodal Large Language Models on Images and Text.

Safety of Multimodal Large Language Models on Images and Texts

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Safety Alignment for Vision Language Models

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

VLSBench: Unveiling Visual Leakage in Multimodal Safety

SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models

Multimodal Situational Safety

CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

Query-Relevant Images Jailbreak Large Multi-Modal Models

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

MLLM-as-a-Judge for Image Safety without Human Labeling

Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models