An Image Captioning Approach Using Dynamical Attention.

Changzhi Wang,Xiaodong Gu
DOI: https://doi.org/10.1109/ijcnn52387.2021.9533994
2021-01-01
Abstract:In recent years, as an active topic in the field of vision and language, image captioning has made great progress. Previous approaches have demonstrated the superiority of spatial and channel attentions in image captioning task. However, such attention-based approaches ignore the difference between function words (e.g., “to”, “for” and “out”) and notional words (e.g., “girl”, “teddy” and “bear”). To address above issue, in this paper we propose a dynamical balancing attention model (BAM) based on attention variation for image captioning, which uses attention variation to fuse channel attention and region attention. Generating function and notional words, it effectively balances the contribution of image channel feature and that of image region feature. Further, the proposed approach dynamically focuses on the most relevant attention features in word prediction. Extensive experimental results on typical datasets show our approach outperforms the attention based approaches and achieves competitive performance over existing end-to-end leading approaches.
What problem does this paper attempt to address?