Image Captioning Based on An Improved Transformer with IoU Position Encoding

Yazhou Li,Yihui Shi,Yun Liu,Ruifan Li,Zhanyu Ma
2021-01-01
Abstract:The task of image captioning aims to automatically generate descriptive sentences for a given image. Most existing works use recurrent neural network as language decoder. In this paper, we use a transformer structure to generate descriptive captions. When applied in the task of image captioning, the transformer network exists two problems. The first is the disappearance of the query vector information in stacking network. The second is the lacking of spatial information between objects in the decoding process. To solve these problems, we propose an improved Transformer with IoU Position encoding model, i.e., TIP. We improve the transformer from two aspects. First, we propose an intra-modal attention mechanism to alleviate the problem of vanishing query vectors. Second, we propose an Intersection-over-Union (IoU) spatial position encoding method to enhance the semantic information of images. Extensive experiments on MS-COCO datasets demonstrate the effectiveness of our model.
What problem does this paper attempt to address?