CausalME: Balancing bi-modalities in Visual Question Answering.

Chenji Lu,Ge Bai,Shilong Li,Ying Liu,Xiyan Liu,Zerong Zeng,Ruifang Liu
DOI: https://doi.org/10.1109/ICASSP48485.2024.10447342
2024-01-01
Abstract:Mitigating linguistic bias and attaining modal equilibrium in Visual Question Answering (VQA) tasks constitute a pivotal concern. Previous work has mainly focused on data augmentation or a uni-modal approach, which is insufficient to fully utilize bi-modal information. In this work, we propose a new causal modal equilibrium framework CausalME, addressing the issue from a causal perspective. CausalME utilizes a question-only branch to capture the linguistic bias of the textual modality and mitigate its causal effect with a newly designed adaptive paradigm. Additionally, CausalME employs counterfactual generation to enhance the causal effect of visual modality. By optimizing the objective function of the entire VQA model, CausalME balances the causal effects of bi-modalities and explicitly guides the model to align text and image information. We conducted extensive experiments and the results show that CausalME brings significant improvements and achieves competitive performance on the bias-sensitive VQA-CP v2 dataset.
What problem does this paper attempt to address?