Abstract:Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable AEs, which succeed across unseen models, is key to developing more robust and practical VLP models. Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process, aiming to improve transferability by expanding the contrast space of image-text features. However, these methods focus solely on diversity around the current AEs, yielding limited gains in transferability. To address this issue, we propose to increase the diversity of AEs by leveraging the intersection regions along the adversarial trajectory during optimization. Specifically, we propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity. We provide a theoretical analysis to demonstrate the effectiveness of the proposed adversarial evolution triangle. Moreover, we find that redundant inactive dimensions can dominate similarity calculations, distorting feature matching and making AEs model-dependent with reduced transferability. Hence, we propose to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace. The proposed semantic-aligned subspace can reduce the image feature redundancy, thereby improving adversarial transferability. Extensive experiments across different datasets and models demonstrate that the proposed method can effectively improve adversarial transferability and outperform state-of-the-art adversarial attack methods. The code is released at <a class="link-external link-https" href="https://github.com/jiaxiaojunQAQ/SA-AET" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the issue of improving the transferability of multimodal adversarial examples (AEs) in visual-language pre-training (VLP) models. Specifically, existing methods, although increasing the diversity of adversarial examples through data augmentation to improve their transferability, mainly focus on the diversity around the current adversarial examples, leading to limited improvements. To overcome this limitation, the authors propose using the Adversarial Evolution Triangle (AET) to increase the diversity of adversarial examples, thereby enhancing their transferability across different models. ### Main Issues: 1. **Limitations of Existing Methods**: Existing methods for generating highly transferable adversarial examples mainly focus on the diversity around the current adversarial examples, leading to overfitting of the adversarial examples and reducing their attack success rate on unseen models. 2. **Improving Transferability of Adversarial Examples**: How to improve the transferability of adversarial examples across different VLP models by increasing their diversity. ### Solution: 1. **Adversarial Evolution Triangle (AET)**: The authors propose using the Adversarial Evolution Triangle, composed of clean samples, historical adversarial examples, and current adversarial examples, during the optimization process to increase the diversity of adversarial examples. 2. **Semantically Aligned Feature Contrast Space**: To reduce redundant information in image features, the authors propose generating adversarial examples in a semantically aligned feature contrast space, which projects the original feature space into a semantic subspace, thereby reducing dependence on the source model and improving the transferability of adversarial examples. ### Experimental Results: - The authors conducted extensive experiments on multiple datasets and models, showing that the proposed method significantly improves the transferability of multimodal adversarial examples and outperforms existing state-of-the-art adversarial attack methods. - When adversarial examples generated from image-text retrieval tasks are applied to other visual-language downstream tasks, the attack performance is also significantly improved. ### Main Contributions: 1. Proposing the use of the intersection of adversarial trajectories in the evolution triangle to enhance the diversity of adversarial examples, thereby improving the transferability of multimodal adversarial examples to VLP models. 2. Exploring the impact of different adversarial evolution sub-triangle sampling strategies on the transferability of adversarial examples and proposing sampling from evolution sub-triangles close to clean samples and previously generated adversarial examples. 3. Proposing generating deviated adversarial texts in the final adversarial evolution triangle of the optimization trajectory to reduce overfitting to the surrogate model and improve transferability. 4. Proposing generating adversarial examples in the semantic image-text feature contrast space to reduce dependence on the victim model, further improving the transferability of adversarial examples. 5. Validating the effectiveness of the proposed method through extensive experiments on various network architectures and datasets, demonstrating that it significantly enhances the transferability of multimodal adversarial examples and outperforms existing state-of-the-art multimodal transfer adversarial attack methods.

Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models

Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Learning to Transform Dynamically for Better Adversarial Transferability

Towards Transferable Unrestricted Adversarial Examples with Minimum Changes

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Enhancing the Transferability of Adversarial Attacks through Variance Tuning

Mutual-modality Adversarial Attack with Semantic Perturbation

Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer

Improving transferable adversarial attack for vision transformers via global attention and local drop

Improving Transferability of Adversarial Examples With Input Diversity

Boosting the Transferability of Video Adversarial Examples Via Temporal Translation.

Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector