Delving Into Precise Attention In Image Captioning

Shaohan Hu,Shenglei Huang,Guolong Wang,Zhipeng Li,Zheng Qin
DOI: https://doi.org/10.1007/978-3-030-36802-9_9
2019-01-01
Abstract:Recent image captioning models usually directly use the output of the last convolutional layer from a pretrained CNN encoder. This intuitive design remains two weaknesses: the top layer feature is not position-sensitive which is harmful for the decoder to generate precise spatial attention for object of interest; irrelevant features will mislead the decoder into focusing irrelevant regions. To tackle these weaknesses, we propose Feature Selection and Fusion Network (FSFN). Specifically, to tackle the first weakness, Feature Fusion module is proposed to generate fine-grained and position-sensitive features by fusing multi-scale features. To handle the second weakness, Feature Selection module is proposed to select more informative features which will prevent the decoder from focusing on irrelevant regions. Extensive experiments demonstrate that our model has successfully addressed the above two weaknesses and can achieve comparable results with the state-of-the-art under cross entropy loss without any bells and whistles on MSCOCO dataset. Furthermore, our model can improve the performance under different encoders and decoders.
What problem does this paper attempt to address?