Dual-feature collaborative relation-attention networks for visual question answering
Lu Yao,You Yang,Juntao Hu,Hu, Juntao
DOI: https://doi.org/10.1007/s13735-023-00283-8
2023-08-05
International Journal of Multimedia Information Retrieval
Abstract:Region and grid features extracted by object detection networks, which contain abundant image information, are widely used in visual question answering (VQA). The regions focus on object-level features, but the grids are better at representing contextual information and fine-grained attributes of images. However, most of the existing VQA models process visual information with one-way attention, failing to capture the internal relations between objects and analyze the feature details. In this work, we propose a novel multi-level collaborative decoder (MLCD) layer based on the encoder–decoder framework to address this issue, which incorporates visual location vectors into attention. Specifically, each MLCD is equipped with three different attention-MLP sub-modules to progressively and accurately mine the intrinsic interactions of features and enhance the influence of image content on prediction results. Additionally, to fully exploit the respective advantages of two features, we propose a novel relativity-augmented cross-attention (RACA) unit and add it to MLCD, in which the features after simple attention are complementarily augmented using global information and self-attributes. To validate the proposed methods, we stack the MLCD layer deeply to constitute our dual-feature collaborative relation-attention network (DFCRAN). We conduct extensive experiments and visualize the results on three benchmark datasets, including COCO-QA, VQA 1.0, and VQA 2.0, to prove the effectiveness of our model and achieve competitive performances compared to the state-of-the-art single models without pre-training.
computer science, artificial intelligence, software engineering