Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction.
Yi Liu,Junwen Pan,Qilong Wang,Guanlin Chen,Weiguo Nie,Yudong Zhang,Qian Gao,Qinghua Hu,Pengfei Zhu
DOI: https://doi.org/10.1007/978-981-99-8850-1_13
2024-01-01
Abstract:Visual question answer (VQA) grounding, aimed at locating the visual evidence associated with the answers while answering questions, has attracted increasing research interest. To locate the evidence, most existing methods extract attention maps in an unsupervised manner from pretrained VQA models. As only the text-related objective is considered during training, the attention map coarsely depicts the grounding region, resulting in poor interpretability. A straightforward solution for improving grounding accuracy is leveraging pixel-wise masks as strong supervision. However, precise per-pixel annotation is time-consuming and labor-intensive. To address above issues, this paper presents the weakly-supervised grounding for VQA, which learns an end-to-end Dual Visual-Linguistic Interaction (DaVi) network in a unified architecture with various low-cost annotations, such as click-, scribble- and box-level grounding labels. Specifically, to enable the visual mask prediction, DaVi proposes a language-based visual decoder that extends the previous VQA network. Since the visual decoder is guided with weak labels, we also present a Pseudo Grounding Refinement Module (PGRM) to refine the relatively coarse predictions as an additional constraint. Extensive experiments demonstrate that our weakly supervised DaVi significantly improves grounding performance even under the click-level supervision with one pixel annotation. Scribble-level supervision achieves 92% performance at a dramatically reduced annotation cost compared to its fully supervised counterpart. More essentially, weak visual grounding usually boosts the accuracy of text answers despite using inaccurate supervision.