VLSBench: Unveiling Visual Leakage in Multimodal Safety

Xuhao Hu,Dongrui Liu,Hao Li,Xuanjing Huang,Jing Shao
2024-11-30
Abstract:Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counter-intuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs trained with image-text pairs. To explain such a counter-intuitive phenomenon, we discover a visual safety information leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky and sensitive content in the image has been revealed in the textual query. In this way, MLLMs can easily refuse these sensitive text-image queries according to textual queries. However, image-text pairs without VSIL are common in real-world scenarios and are overlooked by existing multimodal safety benchmarks. To this end, we construct multimodal visual leakless safety benchmark (VLSBench) preventing visual safety leakage from image to textual query with 2.4k image-text pairs. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o. This study demonstrates that textual alignment is enough for multimodal safety scenarios with VSIL, while multimodal alignment is a more promising solution for multimodal safety scenarios without VSIL. Please see our code and data at: <a class="link-external link-http" href="http://hxhcreate.github.io/VLSBench" rel="external noopener nofollow">this http URL</a>
Cryptography and Security,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is a phenomenon called "Visual Safety Information Leakage" (VSIL) in the existing multimodal safety benchmark tests. Specifically, VSIL refers to the leakage of sensitive or risky content in an image in a text query, which enables multimodal large language models (MLLMs) to reject these sensitive image - text requests by relying only on the text query without the need to understand and perceive the image content. This phenomenon has led to the fact that using text - alignment methods (such as text fine - tuning) can achieve safety performance comparable to that of methods using image - text pairs for alignment (such as supervised fine - tuning SFT and reinforcement learning based on human feedback RLHF), even though the latter requires more data collection and computational costs. In order to more accurately evaluate the safety performance of MLLMs in the absence of VSIL, the authors constructed a new multimodal Visual Leakage - free Safety Benchmark (VLSBench), which contains 2,400 image - text pairs and prevents visual safety information leakage from the image to the text query. The experimental results show that VLSBench poses a significant challenge to the existing open - source and closed - source MLLMs. In particular, in the absence of VSIL, the multimodal alignment method performs better than the text - only alignment method. This indicates that in practical applications, multimodal alignment is a more promising solution, especially when dealing with multimodal safety issues without VSIL.