Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer,Dan Valentine,Luke Bailey,James Chua,Cristóbal Eyzaguirre,Zane Durante,Joe Benton,Brando Miranda,Henry Sleight,John Hughes,Rajashree Agrawal,Mrinank Sharma,Scott Emmons,Sanmi Koyejo,Ethan Perez
2024-12-16
Abstract:The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image ``jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of ``highly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.
Computation and Language,Artificial Intelligence,Cryptography and Security,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to study the vulnerability of Vision - Language Models (VLMs) to transferable image "jailbreak" attacks. Specifically, the authors are concerned with whether images that can lead VLMs to generate harmful but helpful outputs can be effectively transferred between different VLMs. Such attacks are called "harmful but helpful" attacks because they not only produce harmful content but also help users achieve illegal purposes. #### Research Background As multi - modal capabilities are gradually integrated into cutting - edge AI systems, such as Claude 3, GPT4 - V, and Gemini Pro, etc., the security of these systems becomes particularly important. If these models cannot resist attacks from malicious users, it may lead to serious consequences such as false information, phishing, harassment, and possible future weapon development and large - scale cybercrime. #### Main Problems 1. **Are there transferable image jailbreak attacks?** - The authors evaluated more than 40 publicly parameterized VLMs, including 18 newly released VLMs, through large - scale empirical research to determine whether there are gradient - optimized image jailbreaks that can be effectively transferred across different VLMs. 2. **What are the factors affecting the transfer of image jailbreaks?** - The authors explored multiple factors, including whether the attacked and target VLMs share the same visual backbone or language model, and whether the language model has been trained for instruction following and security alignment. 3. **How to improve the transferability of image jailbreaks?** - The authors found that when attacking a highly similar set of VLMs, the transferability against a specific target VLM can be significantly improved. Specifically, using a larger - scale set of similar VLMs for attack can more effectively produce transferable image jailbreaks. #### Key Findings - **Universality and Non - transferability**: When an image jailbreak is optimized for a single VLM or a set of VLMs, it can successfully attack the optimized VLM(s), but has almost no transfer effect on other un - attacked VLMs. - **Partially Successful Transfer**: Only in two cases was partially successful transfer observed: 1. Between VLMs with the same initialization but slightly different training data; 2. Between different training checkpoints of the same VLM. - **Improving Transferability by Attacking Similar VLMs**: By attacking a larger - scale set of "highly similar" VLMs, the transferability against a specific target VLM can be significantly improved. ### Summary This paper reveals the robustness of VLMs in resisting gradient - optimized image jailbreak attacks, which is in sharp contrast to the research results on text jailbreak attacks and transferable attacks on image classifiers in the existing literature. Although the authors failed to find widely transferable image jailbreak attacks, they showed a method to significantly improve transferability by attacking a set of highly similar VLMs. This finding provides important clues for future research and emphasizes that VLMs may have unique mechanisms when processing multi - modal inputs.