Abstract:Existing unbiased VQA models reduce the spurious correlation between questions and answers to force the models to focus on visual information. However, the visual information captured by these unbiased models is irrelevant to the correct answer, resulting in leveraging spurious correlation to predict incorrect answers. This makes these unbiased methods fail to obtain critical visual information, thus performing poorly on questions dominated by the visual information. To capture the valuable visual information, this paper proposes a novel unbiased VQA model based on causal inference, leveraging Instrumental Variable (IVar) to increase the causal effect between visual features and answers. First, to obtain suitable instrumental variables, the noise generator is proposed according to the constraints of IVar. The generated noise can be regarded as IVar, which is used to pollute the original visual features. Then, this paper proposes IVar loss which utilizes the generated IVar to increase the causal effect between visual features and answers. When the visual feature is polluted by IVar, IVar loss guides the model to predict incorrect answers to enhance the correlation between IVar and the answer. Since the correlation between IVar and the answer is proportional to the causal effect between the visual feature and the answer, IVar loss enhances the importance of the visual information, thereby rectifying the model to capture critical visual information. The extensive experimental results on widely-used benchmarks demonstrate the advantages of the proposed method. The proposed method gains the best accuracy on answer type Other of VQA-CP v2. These results demonstrate the superiority of the proposed method in capturing critical visual information since most questions on the answer type Other are dominated by visual information.

Unbiased Visual Question Answering by Leveraging Instrumental Variable

Causal Inference with Knowledge Distilling and Curriculum Learning for Unbiased VQA

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Free VQA Models from Knowledge Inertia by Pairwise Inconformity Learning

Deconfounded Visual Question Generation with Causal Inference

Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool

Debiased Visual Question Answering via the perspective of question types

Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA

Removing Bias of Video Question Answering by Causal Theory

Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning

Visual Grounding Methods for VQA are Working for the Wrong Reasons!

Robust visual question answering via polarity enhancement and contrast

IVQA: Inverse Visual Question Answering.

Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering

Debiased Visual Question Answering from Feature and Sample Perspectives.

Reducing Vision-Answer Biases for Multiple-Choice VQA

Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering

Visual Question Answering Method Based on Counterfactual Thinking

Efficient Counterfactual Debiasing for Visual Question Answering

Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem

Counterfactual Causal-Effect Intervention for Interpretable Medical Visual Question Answering