Abstract:Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the situation where the model generates answers solely based on the surface-level correlations between the question-answer pairs in the training set, without fully understanding the visual content. To reduce the language bias, Several recent approaches increase the image-dependency by introducing auxiliary tasks. However, these auxiliary tasks balance the data by adding extra manual image annotations or simply constructing counterfactual samples, without fully exploring the intrinsic information of the samples themselves. In this paper, we tackle the language bias problem by proposing an adaptive self-supervised counterfactual learning (ASCL) method to enhance the model's understanding of images. We propose a new adaptive feature selection module to mine the intrinsic information of the samples. This module can adaptively divides the image into question-relevant visual positive objects and question-irrelevant visual negative objects based on the given question. The question-relevant visual positive objects are used directly to generate the predicted answer, in order to reduce the influence of visual distracting information on the model's understanding of the image and ensure the actual cause of the answer. The question-irrelevant visual negative objects are treated as counterfactual samples to guide model training and prevent the model from being driven by language bias. To avoid incorrect classification of images on the classification edge during training, we propose an adaptive contrastive loss learning method that automatically adjusts the measurement distance to increase the distance between images on the classification edge. Our method has been extensively evaluated on the VQA-CP dataset, demonstrating its effectiveness and yielding improved results. Specifically, by leveraging the LMH model as a foundation, we achieve state-of-the-art performance on both the VQA CPv1 and VQA CPv2 datasets. Notably, our method significantly enhances the accuracy of the baseline, with improvements of 10.36% on the VQA CPv2 dataset and 9.38% on the VQA CPv1 dataset. The source code is publicly available at: https://github.com/shuxy0120/ASCL .

Alleviating Shortcut Learning Behavior of VQA Model with Context Augmentation and Adaptive Loss Adjustment

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Adversarial Sample Synthesis for Visual Question Answering

Learning content and context with language bias for Visual Question Answering

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

Reducing Vision-Answer Biases for Multiple-Choice VQA

An Empirical Study on the Language Modal in Visual Question Answering

Vqa-bc: robust visual question answering via bidirectional chaining

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

Visual Question Answering Method Based on Counterfactual Thinking

Suppressing Biased Samples for Robust VQA

Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention

Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions

HCCL: H ierarchical C ounterfactual C ontrastive L earning for Robust Visual Question Answering

From Superficial to Deep: Language Bias Driven Curriculum Learning for Visual Question Answering.

ASCL: Adaptive self-supervised counterfactual learning for robust visual question answering

Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering

Removing Bias of Video Question Answering by Causal Theory

Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering