Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models

Yue Xu,Xiuyuan Qi,Zhan Qin,Wenjie Wang
2024-10-17
Abstract:Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to understand multimodal information comprehensively, achieving remarkable performance in many vision-centric tasks. Despite that, recent studies have shown that these models are susceptible to jailbreak attacks, which refer to an exploitative technique where malicious users can break the safety alignment of the target model and generate misleading and harmful answers. This potential threat is caused by both the inherent vulnerabilities of LLM and the larger attack scope introduced by vision input. To enhance the security of MLLMs against jailbreak attacks, researchers have developed various defense techniques. However, these methods either require modifications to the model's internal structure or demand significant computational resources during the inference phase. Multimodal information is a double-edged sword. While it increases the risk of attacks, it also provides additional data that can enhance safeguards. Inspired by this, we propose Cross-modality Information DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs, utilizing the cross-modal similarity between harmful queries and adversarial images. CIDER is independent of the target MLLMs and requires less computation cost. Extensive experimental results demonstrate the effectiveness and efficiency of CIDER, as well as its transferability to both white-box and black-box MLLMs.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the security issue of Multimodal Large Language Models (MLLMs) when facing jailbreak attacks. Specifically, although MLLMs perform well in handling multiple types of data (such as text and images), they are also vulnerable to attacks by malicious users. These users can use carefully designed prompts to make the model generate misleading or harmful responses, thus undermining the safety alignment of the model. This threat not only stems from the inherent vulnerabilities of traditional Large Language Models (LLMs), but also because visual input introduces a larger scope of attack. To solve this problem, researchers have developed various defense techniques, but these methods either require modification of the internal structure of the model or a large amount of computing resources during the inference stage. Therefore, this paper proposes a new solution - Cross - modality Information DEtector (CIDER), which aims to effectively detect and defend against jailbreak attacks on MLLMs by using cross - modality similarity to identify maliciously tampered - with image inputs. ### Main contributions of the paper 1. **Based on the dual characteristics of cross - modality information**: The researchers found that optimized jailbreak attacks will pass the harmful content in malicious queries to the image modality, making the adversarial image closer to the harmful query in the semantic space. CIDER takes advantage of this feature to detect adversarial images by comparing the semantic distance changes between clean images and adversarial images and text queries. 2. **Proposing the plug - in jailbreak detector CIDER**: CIDER is a plug - in detector independent of the target model and can effectively protect MLLMs from jailbreak attacks with almost no additional computational overhead. 3. **Extensive experimental verification**: The experimental results show that CIDER is not only superior to the baseline method in detection success rate, but also significantly reduces the computational cost. In addition, CIDER shows strong generalization ability on white - box and black - box MLLMs and different attack methods. ### Method overview The core idea of CIDER is to detect adversarial images by comparing the semantic distance changes of clean images and adversarial images before and after denoising treatment. The specific steps are as follows: - **Embedding calculation**: For a given text - image input pair, calculate the embedding representations \( E_{\text{text}} \) and \( E_{\text{img}(o)} \) of the text and image. - **Denoising treatment**: Perform 350 denoising iterations on the image and calculate the denoising - after - image embedding \( E_{\text{img}(d)} \) after every 50 iterations. - **Semantic distance calculation**: Calculate the cosine similarity difference between the original image and the denoising - after - image and the text query \( \langle E_{\text{text}}, E_{\text{img}(o)} \rangle-\langle E_{\text{text}}, E_{\text{img}(d)} \rangle \). - **Threshold judgment**: If the above difference exceeds the predefined threshold \( \tau \), then the image is determined to be an adversarial image and the generation of a response is rejected; otherwise, continue to process the original input. In this way, CIDER can detect adversarial images efficiently and accurately, thereby enhancing the security of MLLMs. ### Experimental results The experimental results show that CIDER exhibits a relatively high Detection Success Rate (DSR) and a relatively low Attack Success Rate (ASR) on multiple open - source MLLMs, and only adds about 1.02 seconds of time overhead when processing a single query, which is far lower than the 8 - 9 times time overhead of the baseline method Jailguard. In addition, the performance of CIDER on normal tasks decreases by about 30%, mainly affecting recognition, knowledge, and language generation abilities, but has a smaller impact on OCR, spatial awareness, and math skills. ### Conclusion This paper provides an effective plug - in solution by proposing CIDER, which can be implemented without significantly increasing...