Abstract:Automated image caption generation with attention mechanisms focuses on visual features including objects, attributes, actions, and scenes of the image to understand and provide more detailed captions, which attains great attention in the multimedia field. However, deciding which aspects of an image to highlight for better captioning remains a challenge. Most advanced captioning models utilize only one attention module to assign attention weights to visual vectors, but this may not be enough to create an informative caption. To tackle this issue, we propose an innovative and well-designed Guided Visual Attention (GVA) approach, incorporating an additional attention mechanism to re-adjust the attentional weights on the visual feature vectors and feed the resulting context vector to the language LSTM. Utilizing the first-level attention module as guidance for the GVA module and re-weighting the attention weights significantly enhances the caption's quality. Recently, deep neural networks have allowed the encoder-decoder architecture to make use visual attention mechanism, where faster R-CNN is used for extracting features in the encoder and a visual attention-based LSTM is applied in the decoder. Extensive experiments have been implemented on both the MS-COCO and Flickr30k benchmark datasets. Compared with state-of-the-art methods, our approach achieved an average improvement of 2.4% on BLEU@1 and 13.24% on CIDEr for the MSCOCO dataset, as well as 4.6% on BLEU@1 and 12.48% on CIDEr score for the Flickr30K datasets, based on the cross-entropy optimization. These results demonstrate the clear superiority of our proposed approach in comparison to existing methods using standard evaluation metrics. The implementing code can be found here: (https://github.com/mdbipu/GVA).

Grouped-Attention for Content-Selection and Content-Plan Generation.

Sentence Generation for Entity Description with Content-Plan Attention.

SAC: Accelerating and Structuring Self-Attention Via Sparse Adaptive Connection.

TransCP: A Transformer Pointer Network for Generic Entity Description Generation with Explicit Content-Planning

Neural data-to-text generation with dynamic content planning

Order-Planning Neural Text Generation from Structured Data

Infobox-to-text Generation with Tree-like Planning Based Attention Network

Select and Attend: Towards Controllable Content Selection in Text Generation

GGP: A Graph-based Grouping Planner for Explicit Control of Long Text Generation

Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks

GCP: Graph Encoder with Content-Planning for Sentence Generation from Knowledge Base

GCP: Graph Encoder with Content-Planning for Sentence Generation from Knowledge Bases.

Context-aware graph embedding with gate and attention for session-based recommendation

GVA: guided visual attention approach for automatic image caption generation

A Spatial–Channel–Temporal-Fused Attention for Spiking Neural Networks

A Regularized Framework for Sparse and Structured Neural Attention

Spiking generative adversarial network with attention scoring decoding

Graph-enhanced and collaborative attention networks for session-based recommendation

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion

Bridge the Gap: High-level Semantic Planning for Image Captioning

Neural Abstractive Summarization with Structural Attention