Abstract:Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs. SSA operates through two principal stages: (1) initial response generation, where tools generate or retrieve jailbreak images based on potential harmful intents, and (2) harmful snowballing, where refined subsequent prompts induce progressively harmful outputs. Our experiments demonstrate that \ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs. Unlike prior works that exploit alignment flaws, \ours leverages the inherent properties of LVLMs, presenting a profound challenge for enforcing safety in generative multimodal systems. Our code is avaliable at \url{<a class="link-external link-https" href="https://github.com/gzcch/Safety_Snowball_Agent" rel="external noopener nofollow">this https URL</a>}.

IDEATOR: Jailbreaking VLMs Using VLMs

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

White-box Multimodal Jailbreaks Against Large Vision-Language Models

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything

Efficient LLM-Jailbreaking by Introducing Visual Modality

Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Jailbreaking Attack against Multimodal Large Language Model

PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

Distract Large Language Models for Automatic Jailbreak Attack

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Jailbreaking? One Step Is Enough!

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models