Simple contrastive learning in a self-supervised manner for robust visual question answering
Shuwen Yang,Luwei Xiao,Xingjiao Wu,Junjie Xu,Linlin Wang,Liang He
DOI: https://doi.org/10.1016/j.cviu.2024.103976
IF: 4.886
2024-02-28
Computer Vision and Image Understanding
Abstract:Recent observations have revealed that Visual Question Answering models are susceptible to learning the spurious correlations formed by dataset biases, i.e., the language priors, instead of the intended solution. For instance, given a question and a relative image, some VQA systems are prone to provide the frequently occurring answer in the dataset while disregarding the image content. Such a preferred tendency has caused them to be brittle in real-world settings, harming the robustness of VQA models. We experimentally found that conventional VQA methods often confuse negative samples that with identical questions but different images, which results in the generation of linguistic bias. In this paper, we propose a simple contrastive learning scheme, namely SCLSM, to mitigate the above issues in a self-supervised manner. We construct several special negative samples and introduce a debiasing-aware contrastive learning approach to help the model learn more discriminative multimodal features, thus improving the ability of debiasing. The SCLSM is compatible with numerous VQA baselines. Experimental results on the widely-used public datasets VQA-CP v2 and VQA v2 validate the effectiveness of our proposed model.
computer science, artificial intelligence,engineering, electrical & electronic