Abstract:Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to an LLMs can realize Vision Language Models (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the existing VLMs' visual modality safety alignment by adding safety modules, including a safety projector, safety tokens, and a safety head, through a two-stage training process, effectively improving the model's defense against risky images. For example, building upon the LLaVA-v1.5 model, we achieve a safety score of 8.26, surpassing the GPT-4V on the Red Teaming Visual Language Models (RTVLM) benchmark. Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance. Moreover, our alignment strategy also uncovers some possible risky content within commonly used open-source multimodal datasets. Our code will be open sourced after the anonymous review.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient safety alignment of the visual modality in Vision Language Models (VLMs). Although existing large - language models (LLMs) have undergone safety alignment processing, multi - modal language models built on these models are still vulnerable to attacks during deployment, especially bypassing the safety alignment mechanisms of LLMs through visual modality features. Specifically, the paper points out that when a vision - language model encounters an image containing pornographic, discriminatory, etc. content, it may generate inappropriate or harmful text descriptions. Therefore, the paper proposes a method to enhance the safety alignment of the visual modality in existing VLMs to improve the model's defense ability against risky images. To solve this problem, the author proposes a new method named SafeVLM. This method enhances the safety alignment of the visual modality in VLMs by adding three safety modules (safety projector, safety token, and safety head) and through a two - stage training process. These safety modules can extract potential risk features in the image and interact with text features, thereby effectively improving the model's safety without significantly affecting the model's general performance. In addition, this method also provides a flexible risk - control strategy that can be adjusted according to different risk types and levels, and is suitable for the needs of different countries and regions. Through experiments, the author has proven that SafeVLM outperforms other existing models in multiple benchmark tests, especially when dealing with high - risk content (such as pornographic, politically - sensitive content, etc.), showing stronger safety performance. This indicates that SafeVLM not only improves the model's safety but also has an advantage in maintaining the model's general performance.

Safety Alignment for Vision Language Models

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models

Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

Visual Adversarial Examples Jailbreak Aligned Large Language Models

VLSBench: Unveiling Visual Leakage in Multimodal Safety

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

Red Teaming Visual Language Models

Uncovering Safety Risks of Large Language Models through Concept Activation Vector

Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models