Abstract:Visual language pre-training (VLP) models have demonstrated significant success across various domains, yet they remain vulnerable to adversarial attacks. Addressing these adversarial vulnerabilities is crucial for enhancing security in multimodal learning. Traditionally, adversarial methods targeting VLP models involve simultaneously perturbing images and text. However, this approach faces notable challenges: first, adversarial perturbations often fail to translate effectively into real-world scenarios; second, direct modifications to the text are conspicuously visible. To overcome these limitations, we propose a novel strategy that exclusively employs image patches for attacks, thus preserving the integrity of the original text. Our method leverages prior knowledge from diffusion models to enhance the authenticity and naturalness of the perturbations. Moreover, to optimize patch placement and improve the efficacy of our attacks, we utilize the cross-attention mechanism, which encapsulates intermodal interactions by generating attention maps to guide strategic patch placements. Comprehensive experiments conducted in a white-box setting for image-to-text scenarios reveal that our proposed method significantly outperforms existing techniques, achieving a 100% attack success rate. Additionally, it demonstrates commendable performance in transfer tasks involving text-to-image configurations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the security issue of Vision - Language Pretraining (VLP) models when facing adversarial attacks. Specifically, existing adversarial methods usually perturb both the image and the text simultaneously, and this method has two main challenges: First, it is difficult to effectively transform the adversarial perturbation into an attack in real - world scenarios; second, directly modifying the text is easily detectable. To address these limitations, the authors propose a new strategy, that is, only using image patches for the attack, thereby maintaining the integrity of the original text. In addition, to improve the effectiveness of the attack, this method utilizes the cross - attention mechanism to guide the placement position of the patches and generates more natural adversarial patches through the diffusion model. ### Main contributions of the paper: 1. **First exploration**: As far as the authors know, this is the first study specifically dedicated to researching the security of VLP models through adversarial patch attacks. 2. **Natural adversarial patches**: A framework based on the diffusion model is introduced to generate more natural adversarial patches. 3. **Cross - modal guidance**: The location of the adversarial patches is determined through cross - modal guidance, which improves the effectiveness of the attack. 4. **Experimental verification**: Experiments were carried out on the Flickr30K and MSCOCO datasets, and the results show that this method performs well in a variety of VLP models, especially achieving a 100% attack success rate in the white - box setting. ### Method overview: - **Threat model**: The attacker's goal is to insert an adversarial patch into the visual input of the VLP model, resulting in an incorrect output for the downstream task. - **Diffusion model**: A pre - trained diffusion model is utilized to generate adversarial patches, ensuring that the generated patches are close to the distribution of real images. - **Patch generation**: Adversarial patches are generated through an optimization algorithm, and the cross - attention mechanism is used to determine the optimal placement position of the patches. - **Loss function**: The scoring loss and the total variation loss are combined to optimize the adversarial patches so that they have a high attack effect while maintaining naturalness. ### Experimental results: - **Performance comparison**: On multiple benchmark datasets and VLP models, the attack success rate of this method is significantly higher than that of other existing methods. - **Naturalness**: The generated adversarial patches are not only effective but also very natural and not easily detectable. Through these contributions, this paper provides new ideas and methods for improving the robustness and security of VLP models.

Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models

CAPatch: Physical Adversarial Patch against Image Captioning Systems

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors

Towards Adversarial Attack on Vision-Language Pre-training Models

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

Universal Adversarial Perturbations for Vision-Language Pre-trained Models

Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

Adversarial Prompt Tuning for Vision-Language Models

Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models

Natural Adversarial Patch Generation Method Based on Latent Diffusion Model

Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models

A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Replace-then-Perturb: Targeted Adversarial Attacks With Visual Reasoning for Vision-Language Models

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning