Geometry Attention Transformer with Position-aware LSTMs for Image Captioning

Chi Wang,Yulin Shen,Luping Ji
DOI: https://doi.org/10.1016/j.eswa.2022.117174
IF: 8.5
2022-01-01
Expert Systems with Applications
Abstract:In recent years, Transformer structures have been widely applied in image captioning with impressive performance. However, previous works often neglect the geometry and position relations of different visual objects. These relations are often thought of as crucial information for good captioning results. Aiming to further promote the image captioning by Transformers, this paper proposes an improved Geometry Attention Transformer (GAT) framework. In order to obtain geometric representation ability, two novel geometry-aware architectures are designed respectively for the encoder and decoder in our GAT by i) a geometry gate controlled self-attention refiner, and ii) a group of position-LSTMs. The first one explicitly incorporates relative spatial information into the image representations in encoding steps, and the second one precisely informs the decoder of relative word positions for generating caption texts. The image representations and spatial information are extracted by a pretrained Faster-RCNN network. Our ablation study has proved that these two designed optimization modules could efficiently improve the performance of image captioning. The experiment comparisons on the datasets MS COCO and Flickr30K, also show that our GAT could often outperform current state-of-the-art image captioning models.
What problem does this paper attempt to address?