BENet: bi-directional enhanced network for image captioning

Peixin Yan,Zuoyong Li,Rong Hu,Xinrong Cao
DOI: https://doi.org/10.1007/s00530-023-01230-7
IF: 3.9
2024-01-30
Multimedia Systems
Abstract:Transformer-based models have been used in image captioning to generate a natural language text for describing a given image accurately. In this paper, we propose a bi-directional enhanced network, which strengthens the correlation between image features and text features by the memory bank to improve the performance of the transformer-based encoder–decoder framework for image captioning. In addition, we fine-tune the connection method in the encoder to obtain rich image features. Specifically, during training, the memory bank is first used to store the correspondences between images and annotated texts in the dataset as additional information of image features. After processing through the encoder, we feed the visual features composed of image features and the additional information in the memory bank into the decoder to generate better caption. Subsequently, we utilize a decoder-like architecture to reconstruct visual features from the generated caption. Finally, we calculate the similarity loss between the reconstructed features and the visual features to optimize the encoder. Extensive experiments on the MSCOCO benchmark demonstrate that the proposed method has shown promising results on both the Karpathy test split and the online test server, providing evidence of its effectiveness.
computer science, information systems, theory & methods
What problem does this paper attempt to address?