Recurrent fusion transformer for image captioning

Zhenping Mou,Qiao Yuan,Tianqi Song
DOI: https://doi.org/10.1007/s11760-024-03675-3
IF: 1.583
2024-12-04
Signal Image and Video Processing
Abstract:Image captioning describes the visual content of a given image by using natural language sentences. However, in the existing image captioning model, the encoder only describes the image content from a specific pattern, and cannot fully understand the semantic sequence information of the input image. In this paper, we propose a multimodal recurrent fusion block (RF-Block), which uses a new recurrent attention and combines gated recurrent networks to capture feature correlation information. Also, a stack multiple feature fusion blocks is used can better enhance the relationship between higher-level features. Finally, the feature fusion block is inserted into the Transformer to form a recurrent fusion transformer (RCT), which can improve the performance of the image captioning model. Experimental results show that the proposed model is better than the traditional encoder-decoder image captioning model and provides comparable performance to the most advanced models.
engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?