A Dual Self-Attention based Network for Image Captioning

ZhiYong Li,JinFu Yang,YaPing Li
DOI: https://doi.org/10.1109/CCDC52312.2021.9602488
2021-01-01
Abstract:Image captioning technology has become an important solution for intelligent robots to understand image content. How to extract image information effectively is the key to generate accurate and reliable captions. In this paper, we propose a dual self-attention based network (DSAN) for image captioning. Specifically, we design a Dual Self-Attention Module (DSAM) embedded into an encoding-decoding architecture to capture the contextual information in the image, which can adaptively integrate local features with global dependencies. The DSAM can significantly improve the caption results by modeling rich contextual dependencies over local features. Experimental results on the MS COCO dataset show that the proposed DSAN can achieve better performance than existing methods.
What problem does this paper attempt to address?