ASCL: Adaptive self-supervised counterfactual learning for robust visual question answering
Xinyao Shu,Shiyang Yan,Xu Yang,Ziheng Wu,Zhongfeng Chen,Zhenyu Lu
DOI: https://doi.org/10.1016/j.eswa.2023.123125
IF: 8.5
2024-02-08
Expert Systems with Applications
Abstract:Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the situation where the model generates answers solely based on the surface-level correlations between the question-answer pairs in the training set, without fully understanding the visual content. To reduce the language bias, Several recent approaches increase the image-dependency by introducing auxiliary tasks. However, these auxiliary tasks balance the data by adding extra manual image annotations or simply constructing counterfactual samples, without fully exploring the intrinsic information of the samples themselves. In this paper, we tackle the language bias problem by proposing an adaptive self-supervised counterfactual learning (ASCL) method to enhance the model's understanding of images. We propose a new adaptive feature selection module to mine the intrinsic information of the samples. This module can adaptively divides the image into question-relevant visual positive objects and question-irrelevant visual negative objects based on the given question. The question-relevant visual positive objects are used directly to generate the predicted answer, in order to reduce the influence of visual distracting information on the model's understanding of the image and ensure the actual cause of the answer. The question-irrelevant visual negative objects are treated as counterfactual samples to guide model training and prevent the model from being driven by language bias. To avoid incorrect classification of images on the classification edge during training, we propose an adaptive contrastive loss learning method that automatically adjusts the measurement distance to increase the distance between images on the classification edge. Our method has been extensively evaluated on the VQA-CP dataset, demonstrating its effectiveness and yielding improved results. Specifically, by leveraging the LMH model as a foundation, we achieve state-of-the-art performance on both the VQA CPv1 and VQA CPv2 datasets. Notably, our method significantly enhances the accuracy of the baseline, with improvements of 10.36% on the VQA CPv2 dataset and 9.38% on the VQA CPv1 dataset. The source code is publicly available at: https://github.com/shuxy0120/ASCL .
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science