Abstract:Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the situation where the model generates answers solely based on the surface-level correlations between the question-answer pairs in the training set, without fully understanding the visual content. To reduce the language bias, Several recent approaches increase the image-dependency by introducing auxiliary tasks. However, these auxiliary tasks balance the data by adding extra manual image annotations or simply constructing counterfactual samples, without fully exploring the intrinsic information of the samples themselves. In this paper, we tackle the language bias problem by proposing an adaptive self-supervised counterfactual learning (ASCL) method to enhance the model's understanding of images. We propose a new adaptive feature selection module to mine the intrinsic information of the samples. This module can adaptively divides the image into question-relevant visual positive objects and question-irrelevant visual negative objects based on the given question. The question-relevant visual positive objects are used directly to generate the predicted answer, in order to reduce the influence of visual distracting information on the model's understanding of the image and ensure the actual cause of the answer. The question-irrelevant visual negative objects are treated as counterfactual samples to guide model training and prevent the model from being driven by language bias. To avoid incorrect classification of images on the classification edge during training, we propose an adaptive contrastive loss learning method that automatically adjusts the measurement distance to increase the distance between images on the classification edge. Our method has been extensively evaluated on the VQA-CP dataset, demonstrating its effectiveness and yielding improved results. Specifically, by leveraging the LMH model as a foundation, we achieve state-of-the-art performance on both the VQA CPv1 and VQA CPv2 datasets. Notably, our method significantly enhances the accuracy of the baseline, with improvements of 10.36% on the VQA CPv2 dataset and 9.38% on the VQA CPv1 dataset. The source code is publicly available at: https://github.com/shuxy0120/ASCL .

Simple contrastive learning in a self-supervised manner for robust visual question answering

ASCL: Adaptive self-supervised counterfactual learning for robust visual question answering

Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning

SC-ML: Self-supervised Counterfactual Metric Learning for Debiased Visual Question Answering

Debiasing Medical Visual Question Answering via Counterfactual Training

DCS: Debiased Contrastive Learning with Weak Supervision for Time Series Classification

Simple and Effective Visual Question Answering in a Single Modality

HCCL: H ierarchical C ounterfactual C ontrastive L earning for Robust Visual Question Answering

Overcoming language priors with self-contrastive learning for visual question answering

Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering

A Robust Visual Question Answering Approach to Reduce Multimodal Bias

From Superficial to Deep: Language Bias Driven Curriculum Learning for Visual Question Answering.

Contrastive Visual-Question-Caption Counterfactuals on Biased Samples for Visual Question Answering

Suppressing Biased Samples for Robust VQA

Be Flexible! Learn to Debias by Sampling and Prompting for Robust Visual Question Answering.

Learning content and context with language bias for Visual Question Answering

Debiased Visual Question Answering via the perspective of question types

Robust visual question answering via polarity enhancement and contrast

Robust Visual Question Answering with Contrastive-Adversarial Consistency Constraints

A Multi-modal Debiasing Model with Dynamical Constraint for Robust Visual Question Answering

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs