Abstract:Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the situation where the model generates answers solely based on the surface-level correlations between the question-answer pairs in the training set, without fully understanding the visual content. To reduce the language bias, Several recent approaches increase the image-dependency by introducing auxiliary tasks. However, these auxiliary tasks balance the data by adding extra manual image annotations or simply constructing counterfactual samples, without fully exploring the intrinsic information of the samples themselves. In this paper, we tackle the language bias problem by proposing an adaptive self-supervised counterfactual learning (ASCL) method to enhance the model's understanding of images. We propose a new adaptive feature selection module to mine the intrinsic information of the samples. This module can adaptively divides the image into question-relevant visual positive objects and question-irrelevant visual negative objects based on the given question. The question-relevant visual positive objects are used directly to generate the predicted answer, in order to reduce the influence of visual distracting information on the model's understanding of the image and ensure the actual cause of the answer. The question-irrelevant visual negative objects are treated as counterfactual samples to guide model training and prevent the model from being driven by language bias. To avoid incorrect classification of images on the classification edge during training, we propose an adaptive contrastive loss learning method that automatically adjusts the measurement distance to increase the distance between images on the classification edge. Our method has been extensively evaluated on the VQA-CP dataset, demonstrating its effectiveness and yielding improved results. Specifically, by leveraging the LMH model as a foundation, we achieve state-of-the-art performance on both the VQA CPv1 and VQA CPv2 datasets. Notably, our method significantly enhances the accuracy of the baseline, with improvements of 10.36% on the VQA CPv2 dataset and 9.38% on the VQA CPv1 dataset. The source code is publicly available at: https://github.com/shuxy0120/ASCL .

Fair Attention Network for Robust Visual Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning

Robust visual question answering via polarity enhancement and contrast

Suppressing Biased Samples for Robust VQA

Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering

Multi-source Multi-level Attention Networks for Visual Question Answering

Greedy Gradient Ensemble for Robust Visual Question Answering

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering

Vqa-bc: robust visual question answering via bidirectional chaining

Improved Blending Attention Mechanism in Visual Question Answering

Multi-Modality Global Fusion Attention Network for Visual Question Answering

Deep Residual Weight-Sharing Attention Network with Low-Rank Attention for Visual Question Answering.

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

Guiding Visual Question Answering with Attention Priors

ASCL: Adaptive self-supervised counterfactual learning for robust visual question answering

Task-driven Visual Saliency and Attention-based Visual Question Answering

Learning content and context with language bias for Visual Question Answering

Robust Visual Question Answering: Datasets, Methods, and Future Challenges

An effective spatial relational reasoning networks for visual question answering