Vman: visual-modified attention network for multimodal paradigms

Xiaoyu Song,Dezhi Han,Chongqing Chen,Xiang Shen,Huafeng Wu
DOI: https://doi.org/10.1007/s00371-024-03563-4
IF: 2.835
2024-07-20
The Visual Computer
Abstract:Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving the issues of introducing noise, insufficient information interaction, and obtaining more refined visual features during the image self-interaction of conventional Transformers is challenging. Therefore, this paper proposes a universal visual-modified attention network (VMAN) to address these problems. Specifically, VMAN optimizes the attention mechanism in Transformer, introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. Modulating image features with modified units to obtain more refined query features for subsequent interaction, filtering out noise information while enhancing dependency modeling and reasoning capabilities. Furthermore, two modified approaches have been designed: the weighted sum-based approach and the cross-attention-based approach. Finally, we conduct extensive experiments on VMAN across five benchmark datasets for two tasks (VQA, VG). The results indicate that VMAN achieves an accuracy of 70.99 on the VQA-v2 and makes a breakthrough of 74.41 on the RefCOCOg which involves more complex expressions. The results fully prove the rationality and effectiveness of VMAN. The code is available at https://github.com/79song/VMAN.
computer science, software engineering
What problem does this paper attempt to address?