Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Yunhao Gou,Kai Chen,Zhili Liu,Lanqing Hong,Hang Xu,Zhenguo Li,Dit-Yan Yeung,James T. Kwok,Yu Zhang
2024-10-15
Abstract:Multimodal large language models (MLLMs) have shown impressive reasoning abilities. However, they are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting the unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed with the introduction of image features. To construct robust MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that ECSO enhances model safety significantly (e.g.,, 37.6% improvement on the MM-SafetyBench (SD+OCR) and 71.3% on VLSafe with LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue that multimodal large language models (MLLMs) are prone to generating unsafe content after incorporating image features. Although these models inherit the reasoning capabilities of large language models (LLMs), their safety mechanisms become fragile due to image inputs, making them susceptible to malicious attacks. The paper proposes a new method called ECSO (Eyes Closed, Safety On), which enhances the model's safety by converting unsafe images into text, thereby reactivating the pre-aligned safety mechanisms of LLMs within MLLMs, while maintaining their functional performance. Specifically, ECSO first evaluates whether the response generated by MLLMs is safe. If an unsafe initial response is detected, the image input is converted into text, reducing MLLMs to text-only LLMs to generate a safer response. Experimental results show that ECSO significantly improves the safety of 5 state-of-the-art MLLMs without sacrificing their performance on common MLLM benchmarks. Additionally, ECSO can also be used as a data engine to generate supervised fine-tuning (SFT) data for aligning MLLMs, without the need for additional human intervention.