Contrastive Fusion Representation: Mitigating Adversarial Attacks on VQA Models

Jialing He,Zhen Qin,Hangcheng Liu,Shangwei Guo,Biwen Chen,Ning Wang,Tao Xiang
DOI: https://doi.org/10.1109/ICME55011.2023.00068
2023-01-01
Abstract:Visual Question Answering (VQA) is the vision-language task of answering text-based questions presented in an image and has been advanced by the remarkable success of multimodal deep networks. Similar to unimodal networks, multimodal VQA models are also vulnerable to adversarial examples, which raises severe threats to the corresponding applications. Although several adversarial training methods have been proposed, most of them focus on improving the generalization ability of VQA models on clean samples instead of mitigating the adversarial attacks. In this paper, we systemically analyze the core structure of multimodal VQA networks and propose a novel adversarial training algorithm to mitigate adversarial attacks on VQA models. Specifically, our key component is a regularization term with our carefully designed Contrastive Fusion Representation (CFR), which can reduce the sensitivity of VQA models to adversarial perturbations of both the vision and language inputs. We further enhance the adversarial training with augmented CFRs. Comprehensive experimental results show that our method can mitigate adversarial attacks as well as preserve the generalization ability on clean samples under various system settings and outperforms other defense methods.
What problem does this paper attempt to address?