Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

Qin Liu,Chao Shang,Ling Liu,Nikolaos Pappas,Jie Ma,Neha Anna John,Srikanth Doss,Lluis Marquez,Miguel Ballesteros,Yassine Benajiba
2024-10-12
Abstract:The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ''safety alignment degradation'' in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs. The empirical results show that our framework significantly recovers the alignment ability that is inherited from the LLM backbone with minimal impact on the fluency and linguistic capabilities of pre-trained VLMs even without additional training. Specifically, the unsafe rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as 3.15% with only inference-time intervention. WARNING: This paper contains examples of toxic or harmful language.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the decline in the safety alignment ability of vision - language models (VLMs) when processing multi - modal inputs. Specifically, when the visual module is integrated into the VLM, the overall alignment ability of the model will decline compared to its language model (LLM) backbone. This phenomenon is called "safety alignment degradation". The paper explores the reasons for this phenomenon and proposes a method to mitigate this degradation, that is, by adjusting the representation of multi - modal inputs, enabling the model to recover the inherent safety alignment ability of its LLM backbone while maintaining the functional performance of the VLM. ### Background and Problem Description The development of vision - language models (VLMs) enables the model to process information from both visual and textual modalities, which shows great potential in multiple application fields. However, the introduction of the visual module has led to a decline in the overall alignment ability of the VLM, especially when dealing with safety - related queries. For example, even a blank image may disrupt the safety alignment of the VLM and trigger harmful responses. ### Phenomenon Analysis The paper points out that the main reason for safety alignment degradation is the difference between the representation of multi - modal inputs and that of pure - text inputs. Specifically, the representation of multi - modal inputs tends to be far from the distribution of pure - text inputs, and the latter is the optimization target of the LLM backbone. This difference in representation leads to a decline in the safety alignment ability. ### Solution To alleviate safety alignment degradation, the paper proposes the Cross - Modality Representation Manipulation (CMRM) method. CMRM restores the model's safety alignment ability by pulling the representation of multi - modal inputs back to a distribution close to that optimized by the LLM backbone by adjusting the hidden state of the model during the inference stage. ### Experimental Results The experimental results show that CMRM can significantly reduce the insecurity rate of VLM when processing multi - modal inputs. For example, the insecurity rate of the LLaVA - 7B model under multi - modal inputs is reduced from 61.53% to 3.15% without additional training. In addition, the application of CMRM does not significantly affect the general performance of the model and even improves the performance in some cases. ### Main Contributions 1. **Phenomenon Analysis**: Analyzed the safety alignment degradation phenomenon of VLM from the perspective of model representation. Experiments have proven that simply splicing embeddings of different modalities will lead to representation shift, thereby suppressing the inherent alignment ability of the LLM backbone. 2. **Method Proposal**: Introduced CMRM, a representation engineering method. By adjusting the representation of multi - modal inputs, the model's representation is pulled back to the distribution optimized by the LLM backbone, restoring the model's safety alignment ability. 3. **Experimental Verification**: Verified the effectiveness of CMRM through experiments. It can significantly restore the model's safety alignment ability without sacrificing the general performance of the VLM. In conclusion, through in - depth analysis of the VLM safety alignment degradation phenomenon, this paper proposes an effective solution, providing new ideas for future VLM safety alignment research.