Local-global Visual Interaction Attention for Image Captioning

Changzhi Wang,Xiaodong Gu
DOI: https://doi.org/10.1016/j.dsp.2022.103707
IF: 2.92
2022-01-01
Digital Signal Processing
Abstract:Image captioning is a typical cross-modal task, which aims to automatically describe the main content of an image with a complete and natural sentence. Existing attention based approaches treat local feature and global feature in the image individually, neglecting the intrinsic interaction between them that provides important guidance for generating caption. To alleviate the above issue, in this work we propose a novel Local-Global Visual Interaction Attention (LGVIA) structure that novelly explores the intrinsic interactions between local feature and global feature in the image. Specifically, we devise a new visual interaction graph network that mainly consists of visual interaction encoding module and visual interaction fusion module. The former implicitly encodes the visual relationships between local feature and global feature to obtain an enhanced visual representation containing rich local-global feature relationship. The latter fuses the previously obtained multiple relationship features to further enrich different-level relationship attribute information. In addition, we introduce a new relationship attention based LSTM module to guide the word generation by dynamically focusing on the previously output fusion relationship information. Extensive qualitative and quantitative experimental results show that the superiority of our LGVIA approach on the large-scale MSCOCO dataset. More remarkably, LGVIA outperforms the related state-of-the-art methods on the small-scale Flickr30k dataset.
What problem does this paper attempt to address?