Abstract:Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable AEs, which succeed across unseen models, is key to developing more robust and practical VLP models. Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process, aiming to improve transferability by expanding the contrast space of image-text features. However, these methods focus solely on diversity around the current AEs, yielding limited gains in transferability. To address this issue, we propose to increase the diversity of AEs by leveraging the intersection regions along the adversarial trajectory during optimization. Specifically, we propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity. We provide a theoretical analysis to demonstrate the effectiveness of the proposed adversarial evolution triangle. Moreover, we find that redundant inactive dimensions can dominate similarity calculations, distorting feature matching and making AEs model-dependent with reduced transferability. Hence, we propose to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace. The proposed semantic-aligned subspace can reduce the image feature redundancy, thereby improving adversarial transferability. Extensive experiments across different datasets and models demonstrate that the proposed method can effectively improve adversarial transferability and outperform state-of-the-art adversarial attack methods. The code is released at <a class="link-external link-https" href="https://github.com/jiaxiaojunQAQ/SA-AET" rel="external noopener nofollow">this https URL</a>.

Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models

Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models

Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models

Towards Adversarial Attack on Vision-Language Pre-training Models

Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models

Downstream Task-agnostic Transferable Attacks on Language-Image Pre-training Models.

A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

An Optimized Transfer Attack Framework Towards Multi-Modal Machine Learning

Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack

Improving Adversarial Transferability by Stable Diffusion

Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models