Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Jiwei Guan,Tianyu Ding,Longbing Cao,Lei Pan,Chen Wang,Xi Zheng

2024-08-24

Abstract:Vision-language pretraining (VLP) with transformers has demonstrated exceptional performance across numerous multimodal tasks. However, the adversarial robustness of these models has not been thoroughly investigated. Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities, particularly in the context of cross-attention mechanisms. In this paper, we study the adversarial vulnerability of recent VLP transformers and design a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities under white-box settings. JMTFA strategically targets attention relevance scores to disrupt important features within each modality, generating adversarial samples by fusing perturbations and leading to erroneous model predictions. Experimental results indicate that the proposed approach achieves high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, our findings reveal that the textual modality significantly influences the complex fusion processes within VLP transformers. Moreover, we observe no apparent relationship between model size and adversarial robustness under our proposed attacks. These insights emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to explore the vulnerability of Vision-Language Pretraining (VLP) models when faced with adversarial attacks and designs a new Joint Multimodal Transformer Feature Attack (JMTFA) method. The paper points out that although VLP models perform well in various cross-modal tasks, their adversarial robustness has not been fully studied. Existing multimodal attack methods mostly ignore the cross-modal interactions between visual and textual modalities, especially in the context of cross-attention mechanisms. The JMTFA method introduces adversarial perturbations to both visual and textual modalities simultaneously in a white-box setting, generating adversarial samples by specifically disrupting important features in each modality, leading to incorrect model predictions. Experimental results show that compared to existing baselines, this method achieves a higher attack success rate in downstream tasks of vision-language understanding and reasoning. Additionally, the study finds that the textual modality significantly impacts the complex fusion process in VLP transformers, and there is no obvious relationship between model size and adversarial robustness. These insights highlight a new dimension of adversarial robustness and underscore the potential risks in the reliable deployment of multimodal AI systems.

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Protego: Detecting Adversarial Examples for Vision Transformers Via Intrinsic Capabilities

Towards Adversarial Attack on Vision-Language Pre-training Models

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction

A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

On Evaluating Adversarial Robustness of Large Vision-Language Models

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models

Universal Adversarial Perturbations for Vision-Language Pre-trained Models

Partially Recentralization Softmax Loss for Vision-Language Models Robustness

On the Adversarial Robustness of Vision Transformers

White-box Multimodal Jailbreaks Against Large Vision-Language Models

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

Adversarial Prompt Tuning for Vision-Language Models