Abstract:Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the situation where the model generates answers solely based on the surface-level correlations between the question-answer pairs in the training set, without fully understanding the visual content. To reduce the language bias, Several recent approaches increase the image-dependency by introducing auxiliary tasks. However, these auxiliary tasks balance the data by adding extra manual image annotations or simply constructing counterfactual samples, without fully exploring the intrinsic information of the samples themselves. In this paper, we tackle the language bias problem by proposing an adaptive self-supervised counterfactual learning (ASCL) method to enhance the model's understanding of images. We propose a new adaptive feature selection module to mine the intrinsic information of the samples. This module can adaptively divides the image into question-relevant visual positive objects and question-irrelevant visual negative objects based on the given question. The question-relevant visual positive objects are used directly to generate the predicted answer, in order to reduce the influence of visual distracting information on the model's understanding of the image and ensure the actual cause of the answer. The question-irrelevant visual negative objects are treated as counterfactual samples to guide model training and prevent the model from being driven by language bias. To avoid incorrect classification of images on the classification edge during training, we propose an adaptive contrastive loss learning method that automatically adjusts the measurement distance to increase the distance between images on the classification edge. Our method has been extensively evaluated on the VQA-CP dataset, demonstrating its effectiveness and yielding improved results. Specifically, by leveraging the LMH model as a foundation, we achieve state-of-the-art performance on both the VQA CPv1 and VQA CPv2 datasets. Notably, our method significantly enhances the accuracy of the baseline, with improvements of 10.36% on the VQA CPv2 dataset and 9.38% on the VQA CPv1 dataset. The source code is publicly available at: https://github.com/shuxy0120/ASCL .

From Superficial to Deep: Language Bias Driven Curriculum Learning for Visual Question Answering.

Simple and Effective Visual Question Answering in a Single Modality

Debiasing Medical Visual Question Answering via Counterfactual Training

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention

ASCL: Adaptive self-supervised counterfactual learning for robust visual question answering

Learning content and context with language bias for Visual Question Answering

SC-ML: Self-supervised Counterfactual Metric Learning for Debiased Visual Question Answering

An Empirical Study on the Language Modal in Visual Question Answering

Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

Vqa-bc: robust visual question answering via bidirectional chaining

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Simple contrastive learning in a self-supervised manner for robust visual question answering

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning

Suppressing Biased Samples for Robust VQA

Overcoming language priors with self-contrastive learning for visual question answering

Plenty is Plague: Fine-Grained Learning for Visual Question Answering.

Language bias in Visual Question Answering: A Survey and Taxonomy

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts