Safety Alignment for Vision Language Models

Zhendong Liu,Yuanbi Nie,Yingshui Tan,Xiangyu Yue,Qiushi Cui,Chongjun Wang,Xiaoyong Zhu,Bo Zheng
2024-05-22
Abstract:Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to an LLMs can realize Vision Language Models (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the existing VLMs' visual modality safety alignment by adding safety modules, including a safety projector, safety tokens, and a safety head, through a two-stage training process, effectively improving the model's defense against risky images. For example, building upon the LLaVA-v1.5 model, we achieve a safety score of 8.26, surpassing the GPT-4V on the Red Teaming Visual Language Models (RTVLM) benchmark. Our method boasts ease of use, high flexibility, and strong controllability, and it enhances safety while having minimal impact on the model's general performance. Moreover, our alignment strategy also uncovers some possible risky content within commonly used open-source multimodal datasets. Our code will be open sourced after the anonymous review.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient safety alignment of the visual modality in Vision Language Models (VLMs). Although existing large - language models (LLMs) have undergone safety alignment processing, multi - modal language models built on these models are still vulnerable to attacks during deployment, especially bypassing the safety alignment mechanisms of LLMs through visual modality features. Specifically, the paper points out that when a vision - language model encounters an image containing pornographic, discriminatory, etc. content, it may generate inappropriate or harmful text descriptions. Therefore, the paper proposes a method to enhance the safety alignment of the visual modality in existing VLMs to improve the model's defense ability against risky images. To solve this problem, the author proposes a new method named SafeVLM. This method enhances the safety alignment of the visual modality in VLMs by adding three safety modules (safety projector, safety token, and safety head) and through a two - stage training process. These safety modules can extract potential risk features in the image and interact with text features, thereby effectively improving the model's safety without significantly affecting the model's general performance. In addition, this method also provides a flexible risk - control strategy that can be adjusted according to different risk types and levels, and is suitable for the needs of different countries and regions. Through experiments, the author has proven that SafeVLM outperforms other existing models in multiple benchmark tests, especially when dealing with high - risk content (such as pornographic, politically - sensitive content, etc.), showing stronger safety performance. This indicates that SafeVLM not only improves the model's safety but also has an advantage in maintaining the model's general performance.