Abstract:Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methods mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. Different from existing attacks, in this work we propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Specifically, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the vulnerability of VLMs and the urgent need for new alignment strategies.

IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

IDEATOR: Jailbreaking VLMs Using VLMs

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

White-box Multimodal Jailbreaks Against Large Vision-Language Models

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Jailbreaking Attack against Multimodal Large Language Model

Efficient LLM-Jailbreaking by Introducing Visual Modality

Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection

Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Distract Large Language Models for Automatic Jailbreak Attack

PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

Jailbreaking? One Step Is Enough!

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

Tastle: Distract Large Language Models for Automatic Jailbreak Attack

DeepInception: Hypnotize Large Language Model to Be Jailbreaker