Context-aware Multi-level Question Embedding Fusion for visual question answering
Shengdong Li,Chen Gong,Yuqing Zhu,Chuanwen Luo,Yi Hong,Xueqiang Lv
DOI: https://doi.org/10.1016/j.inffus.2023.102000
IF: 18.6
2024-02-01
Information Fusion
Abstract:Question model has been widely concerned as the cornerstone of constructing Visual Question Answering (VQA) models. Existing question models attempt to exploit word context to extract multi-level concepts for modeling multi-level questions. However, they still have many defects. For example, most question models utilize simple fusion methods to fuse shallow modules and extract parameter-unshared low-level concepts, leading to poor modeling of multi-level questions; although some question models use deep bidirectional Transformer encoder in external knowledge transfer and BERT for multi-level questions, their complexity is still high. To solve these issues, we propose a novel low-complex multi-level contextual question model, termed Context-aware Multi-level Question Embedding Fusion (CMQEF). We formalize its concepts and theories, deduce its modeling process, optimization process and feature extraction process, analyze its low complexity and high expressiveness, and prove that it defines a new way to solve parameter non-sharing for extracting parameter-shared multi-level concepts and optimize the tradeoff between expressiveness and complexity in question models. Extensive experiments on VQAv2 and VQA-CPv2 validate that comparing with the state-of-the-art, our CMQEF outperforms it on SANs and UpDn, reduces the language priors of SANs and UpDn, and has preferable interpretability and applicability. Our code is available at https://github.com/lsdruc/CMQEF-for-VQA.
computer science, artificial intelligence, theory & methods