Abstract:Self-attention mechanism, which has been successfully applied to current encoder-decoder framework of image captioning, is used to enhance the feature representation in the image encoder and capture the most relevant information for the language decoder. However, most existing methods will assign attention weights to all candidate vectors, which implicitly hypothesizes that all vectors are relevant. Moreover, current self-attention mechanisms ignore the intra-object attention distribution, and only consider the inter-object relationships. In this paper, we propose a Multi-Gate Attention (MGA) block, which expands the traditional self-attention by equipping with additional Attention Weight Gate (AWG) module and Self-Gated (SG) module. The former constrains the attention weights to be assigned to the most contributive objects. The latter is adopted to consider the intra-object attention distribution and eliminate the irrelevant information in object feature vector. Furthermore, most current image captioning methods apply the original transformer designed for natural language processing task, to refine image features directly. Therefore, we propose a pre-layernorm transformer to simplify the transformer architecture and make it more efficient for image feature enhancement. By integrating MGA block with pre-layernorm transformer architecture into the image encoder and AWG module into the language decoder, we present a novel Multi-Gate Attention Network (MGAN). The experiments on MS COCO dataset indicate that the MGAN outperforms most of the state-of-the-art, and further experiments on other methods combined with MGA blocks demonstrate the generalizability of our proposal.

LG-MLFormer: Local and Global MLP for Image Captioning

Recurrent convolutional video captioning with global and local attention.

Llafn-Generator: Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning

Local-global Visual Interaction Attention for Image Captioning

Image Caption with Global-Local Attention

Image Captioning with Local-Global Visual Interaction Network.

Modeling Local and Global Contexts for Image Captioning

Fine-Grained Image Captioning with Global-Local Discriminative Objective.

Towards Local Visual Modeling for Image Captioning

Image Captioning Based on Global-Local Feature and Adaptive-Attention

Video Captioning Using Global-Local Representation

GLA: Global–Local Attention for Image Description

Local-to-Global Semantic Supervised Learning for Image Captioning

CA-Captioner: A Novel Concentrated Attention for Image Captioning

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

A Multi-task Learning Approach for Image Captioning.

Multi-Gate Attention Network for Image Captioning

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Constrained LSTM and Residual Attention for Image Captioning

InfMLLM: A Unified Framework for Visual-Language Tasks.