Bidirectional interactive alignment network for image captioning

Xinrong Cao,Peixin Yan,Rong Hu,Zuoyong Li
DOI: https://doi.org/10.1007/s00530-024-01559-7
IF: 3.9
2024-11-23
Multimedia Systems
Abstract:In recent years, many researchers have improved image captioning performance by fusing region features and grid features. However, the semantic gap between the two features is often overlooked during fusion, and the exploration of multimodal feature interaction remains insufficient. In this paper, we propose a Bidirectional Interactive Alignment Network (BIANet) to achieve more multi-feature and multi-modal fusion in both the encoder and decoder. We propose a bidirectional interactive encoder that utilizes cross-interaction to complement the advantages of both image features, enriching their visual information. In the decoder, we propose a cross-alignment module. This module enables the text features to interact in two sequences: "region feature-grid feature" and "grid feature-region feature", resulting in two new text features. By improving the similarity between these two text features, the semantic gap between region features and grid features is indirectly alleviated. Extensive experiments on the MS COCO dataset demonstrate that the proposed model achieves competitive results on the Karpathy test split.
computer science, information systems, theory & methods
What problem does this paper attempt to address?