Avtmnet: Adaptive Visual-Text Merging Network for Image Captioning

Heng Song,Junwu Zhu,Yi Jiang
DOI: https://doi.org/10.1016/j.compeleceng.2020.106630
IF: 4.152
2020-01-01
Computers & Electrical Engineering
Abstract:Recently, researchers have made extensive research on the technology of automatically generating descriptions for an image. Various technologies for image captioning have been proposed, among which attention-based encoder-decoder framework achieved great success. Two different types of attention models are proposed to generate image captions respectively, i.e., model based visual attention that is good at describing details, and model based text attention that is good at comprehensive understanding. In order to integrate and make full use of visual information and text information to generate more accurate captions for images, in this paper, we firstly introduce a visual attention model to generate the visual information and a text attention model to form the text information respectively, and then propose an adaptive visual-text merging network(avtmNet). This merging network can effectively merge the visual information and text information, and automatically determine the proportion of both visual information and text information to generate the next caption word. Extensive experiments are performed on the datasets named COCO2014 and Flickr30K respectively, and show the effectiveness and superiority of our proposed approach.
What problem does this paper attempt to address?