Adversarial Sample Synthesis for Visual Question Answering

Chuanhao Li,Chenchen Jing,Zhen Li,Yuwei Wu,Yunde Jia
DOI: https://doi.org/10.1145/3688848
2024-01-01
Abstract:Language prior is a major block to improving the generalization of visual question answering (VQA) models. Recent work has revealed that synthesizing extra training samples to balance training sets is a promising way to alleviate language priors. However, most existing methods synthesize extra samples in a manner independent of training processes, which neglect the fact that the language priors memorized by VQA models are changing during training, resulting in insufficient synthesized samples. In this paper, we propose an adversarial sample synthesis method, which synthesizes different adversarial samples by adversarial masking at different training epochs to cope with the changing memorized language priors. The basic idea behind our method is to use adversarial masking to synthesize adversarial samples that will cause the model to make wrong answers. To this end, we design a generative module to carry out adversarial masking by attacking the VQA model, and introduce a bias-oriented objective to supervise the training of the generative module. We couple the sample synthesis with the training process of the VQA model, which ensures that the synthesized samples at different training epochs are beneficial to the VQA model. We incorporated the proposed method into three VQA models including UpDn, LMH and LXMERT, and conducted experiments on three datasets including VQA-CP v1, VQA-CP v2 and VQA v2. Experimental results demonstrate that a large improvement of our method, such as 16.22% gains on LXMERT in the overall accuracy of VQA-CP v2.
What problem does this paper attempt to address?