Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Jiwei Guan,Tianyu Ding,Longbing Cao,Lei Pan,Chen Wang,Xi Zheng
2024-08-24
Abstract:Vision-language pretraining (VLP) with transformers has demonstrated exceptional performance across numerous multimodal tasks. However, the adversarial robustness of these models has not been thoroughly investigated. Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities, particularly in the context of cross-attention mechanisms. In this paper, we study the adversarial vulnerability of recent VLP transformers and design a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities under white-box settings. JMTFA strategically targets attention relevance scores to disrupt important features within each modality, generating adversarial samples by fusing perturbations and leading to erroneous model predictions. Experimental results indicate that the proposed approach achieves high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, our findings reveal that the textual modality significantly influences the complex fusion processes within VLP transformers. Moreover, we observe no apparent relationship between model size and adversarial robustness under our proposed attacks. These insights emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to explore the vulnerability of Vision-Language Pretraining (VLP) models when faced with adversarial attacks and designs a new Joint Multimodal Transformer Feature Attack (JMTFA) method. The paper points out that although VLP models perform well in various cross-modal tasks, their adversarial robustness has not been fully studied. Existing multimodal attack methods mostly ignore the cross-modal interactions between visual and textual modalities, especially in the context of cross-attention mechanisms. The JMTFA method introduces adversarial perturbations to both visual and textual modalities simultaneously in a white-box setting, generating adversarial samples by specifically disrupting important features in each modality, leading to incorrect model predictions. Experimental results show that compared to existing baselines, this method achieves a higher attack success rate in downstream tasks of vision-language understanding and reasoning. Additionally, the study finds that the textual modality significantly impacts the complex fusion process in VLP transformers, and there is no obvious relationship between model size and adversarial robustness. These insights highlight a new dimension of adversarial robustness and underscore the potential risks in the reliable deployment of multimodal AI systems.