Modality Re-Balance for Visual Question Answering: A Causal Framework.

Xinpeng Lv,Wanrong Huang,Haotian Wang,Ruochun Jin,Xueqiong Li,Zhipeng Lin,Shuman Li,Yongquan Feng,Yuhua Tang
DOI: https://doi.org/10.1109/ICASSP48485.2024.10447690
2024-01-01
Abstract:Visual Question Answering (VQA) models often prioritize language cues over visual knowledge, leading to the "language prior" phenomenon. To address this, researchers have proposed methods to balance language and image information during training and inference. However, these approaches often struggle to capture important linguistic components due to the excessive exclusion of language information. Inspired by causal inference, we introduce a novel approach called the SyMmetrically Balanced Causal framework (SMBC) that rebalances visual and textual information in VQA tasks. This framework allows for an equal contribution of knowledge from both modalities to inference results. Experimental evaluation shows that SMBC: 1) applies to prevalent VQA models, including those with data augmentation, and 2) consistently improves performance on established benchmarks.
What problem does this paper attempt to address?