Abstract:Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models--a common and practical real world scenario--remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal model (IMM) as a surrogate model to craft adversarial video samples. Multimodal interactions and temporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. In addition, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as surrogate model) achieve competitive performance, with average attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for VideoQA tasks, respectively. Our code will be released upon acceptance.

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction

Transferable Multimodal Attack on Vision-Language Pre-training Models

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

Towards Adversarial Attack on Vision-Language Pre-training Models

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models

Downstream Task-agnostic Transferable Attacks on Language-Image Pre-training Models.

Mutual-modality Adversarial Attack with Semantic Perturbation

Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models