VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models

Ziyi Yin,Muchao Ye,Tianrong Zhang,Jiaqi Wang,Han Liu,Jinghui Chen,Ting Wang,Fenglong Ma
2024-02-17
Abstract:Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the ``pre-training & finetuning'' learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQAttack model, which can iteratively generate both image and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQAttack in the transferable attack setting, compared with state-of-the-art baselines. This work reveals a significant blind spot in the ``pre-training & fine-tuning'' paradigm on VQA tasks. Source codes will be released.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore the adversarial robustness of the Visual Question Answering (VQA) task under the "pre - training & fine - tuning" paradigm. Specifically, the author focuses on how to use the pre - trained multi - modal source model to generate adversarial image - text pairs and transfer them to attack the target VQA model. This research reveals the potential security problems of the VQA task under adversarial attacks in the "pre - training & fine - tuning" paradigm, which is a blind spot rarely involved in previous studies. ### Research Background Visual Question Answering (VQA) is a fundamental task that combines the fields of computer vision and natural language processing, aiming to extract key information from images to answer questions in text form. Although the "pre - training & fine - tuning" learning paradigm has significantly improved the performance of VQA, the adversarial robustness in this paradigm has not been fully explored. Current research mainly focuses on investigating the robustness of end - to - end trained VQA models by developing effective attack methods, but the performance of these models is usually not as good as that of models under the "pre - training & fine - tuning" paradigm. ### Research Challenges 1. **Transferability across models**: The transferability of adversarial attacks across different models is a challenge. The pre - trained source model and the victim target VQA model are usually trained on different tasks and datasets, and structural differences may be generated due to changes during the fine - tuning process. Although transferability has been widely verified in image models, there is still a lack of comprehensive exploration in the field of pre - trained models. 2. **Cross - modal joint attack**: The VQA task is a multi - modal problem, and it is necessary to perturb both the image and the text simultaneously to improve the attack effect. Since the image values are continuous and the text values are discrete, it makes optimizing perturbations between the two simultaneously a complex task. ### Solutions To address the above challenges, the author proposes a new method named VQA TTACK to explore the adversarial transferability between the pre - trained source model and the victim target VQA model. VQA TTACK generates image and text perturbations based on the pre - trained source model and adopts a novel multi - step attack framework. This method includes two key modules: - **Large Language Model (LLM) - enhanced image attack module**: Generate image perturbations by minimizing the latent feature similarity between clean inputs and perturbed inputs, and introduce a new masked - answer anti - recovery loss to further enhance the image perturbations using LLM. - **Cross - modal joint attack module**: Update the image and text perturbations under specific iteration conditions, where the text perturbation update is based on gradients in the word - embedding space and synonym replacement. ### Experimental Results The experiments were carried out on two VQA datasets (VQAv2 and TextVQA), involving five pre - trained models. The experimental results show that VQA TTACK significantly outperforms existing baseline methods in the transferable attack setting, revealing the deficiency of the "pre - training & fine - tuning" paradigm in the adversarial robustness of the VQA task. ### Main Contributions - For the first time, the adversarial robustness of the VQA task under the "pre - training & fine - tuning" paradigm has been studied, and potential security problems have been explored. - Proposed VQA TTACK, a new method for generating adversarial image - text pairs, including two innovative modules, using LLM to generate masked text and achieving iterative joint attacks on image and text modalities. - The experimental results verify the effectiveness of VQA TTACK in the transferable attack setting.