Incorporating retrieval-based method for feature enhanced image captioning

Shanshan Zhao,Lixiang Li,Haipeng Peng
DOI: https://doi.org/10.1007/s10489-022-04010-4
IF: 5.3
2022-08-13
Applied Intelligence
Abstract:Image captioning is a cross-modal task to describe an image into descriptions. The commonly used image captioning methods include the generation-based and the retrieval-based method. In this paper, we propose a feature enhanced image captioning model, which is mainly made up of three parts: cross-modal feature enhanced module (CFD), gated feature fusion (GFF), cross-modal decoder. The retrieval-based method first retrieved the semantic related similar sentences for each image. CFD mutually coarse aligned the region-based visual features with the word-based similar sentences. GFF further performs a deeper interaction for the coarse aligned visual and semantic features through a dynamic gate to control the fusion level, and get the fine aligned features. We concatenated the two sets fine aligned features as the enhanced features. Both the visual relationship features and the enhanced features guide the cross-modal decoder generate the description. Our model got 131.0 and 68.3 CIDEr score when it compared with different methods on MSCOCO and Flickr30k. Further ablation studies also demonstrate the effectiveness of each component.
computer science, artificial intelligence
What problem does this paper attempt to address?