Abstract:Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the ``pre-training & finetuning'' learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQAttack model, which can iteratively generate both image and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack in the transferable attack setting, compared with state-of-the-art baselines. This work reveals a significant blind spot in the ``pre-training & fine-tuning'' paradigm on VQA tasks. Source codes will be released.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to explore the adversarial robustness of the Visual Question Answering (VQA) task under the "pre - training & fine - tuning" paradigm. Specifically, the author focuses on how to use the pre - trained multi - modal source model to generate adversarial image - text pairs and transfer them to attack the target VQA model. This research reveals the potential security problems of the VQA task under adversarial attacks in the "pre - training & fine - tuning" paradigm, which is a blind spot rarely involved in previous studies. ### Research Background Visual Question Answering (VQA) is a fundamental task that combines the fields of computer vision and natural language processing, aiming to extract key information from images to answer questions in text form. Although the "pre - training & fine - tuning" learning paradigm has significantly improved the performance of VQA, the adversarial robustness in this paradigm has not been fully explored. Current research mainly focuses on investigating the robustness of end - to - end trained VQA models by developing effective attack methods, but the performance of these models is usually not as good as that of models under the "pre - training & fine - tuning" paradigm. ### Research Challenges 1. **Transferability across models**: The transferability of adversarial attacks across different models is a challenge. The pre - trained source model and the victim target VQA model are usually trained on different tasks and datasets, and structural differences may be generated due to changes during the fine - tuning process. Although transferability has been widely verified in image models, there is still a lack of comprehensive exploration in the field of pre - trained models. 2. **Cross - modal joint attack**: The VQA task is a multi - modal problem, and it is necessary to perturb both the image and the text simultaneously to improve the attack effect. Since the image values are continuous and the text values are discrete, it makes optimizing perturbations between the two simultaneously a complex task. ### Solutions To address the above challenges, the author proposes a new method named VQA TTACK to explore the adversarial transferability between the pre - trained source model and the victim target VQA model. VQA TTACK generates image and text perturbations based on the pre - trained source model and adopts a novel multi - step attack framework. This method includes two key modules: - **Large Language Model (LLM) - enhanced image attack module**: Generate image perturbations by minimizing the latent feature similarity between clean inputs and perturbed inputs, and introduce a new masked - answer anti - recovery loss to further enhance the image perturbations using LLM. - **Cross - modal joint attack module**: Update the image and text perturbations under specific iteration conditions, where the text perturbation update is based on gradients in the word - embedding space and synonym replacement. ### Experimental Results The experiments were carried out on two VQA datasets (VQAv2 and TextVQA), involving five pre - trained models. The experimental results show that VQA TTACK significantly outperforms existing baseline methods in the transferable attack setting, revealing the deficiency of the "pre - training & fine - tuning" paradigm in the adversarial robustness of the VQA task. ### Main Contributions - For the first time, the adversarial robustness of the VQA task under the "pre - training & fine - tuning" paradigm has been studied, and potential security problems have been explored. - Proposed VQA TTACK, a new method for generating adversarial image - text pairs, including two innovative modules, using LLM to generate masked text and achieving iterative joint attacks on image and text modalities. - The experimental results verify the effectiveness of VQA TTACK in the transferable attack setting.

VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

VADS: Visuo-Adaptive DualStrike Attack on Visual Question Answer

Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction

Transferable Multimodal Attack on Vision-Language Pre-training Models

Adversarial Sample Synthesis for Visual Question Answering

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation

Towards Adversarial Attack on Vision-Language Pre-training Models

Contrastive Fusion Representation: Mitigating Adversarial Attacks on VQA Models

Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models

Replace-then-Perturb: Targeted Adversarial Attacks With Visual Reasoning for Vision-Language Models

Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

Attend and Attack : Attention Guided Adversarial Attacks on Visual Question Answering Models

Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models