Abstract:Recently, a series of attempts have incorporated spatial attention mechanisms into the task of image captioning, which achieves a remarkable improvement in the quality of generative captions. However, the traditional spatial attention mechanism adopts latent and delayed semantic representations to decide which area should be paid more attention to, resulting in inaccurate semantic guidance and the introduction of redundant information. In order to optimize the spatial attention mechanism, we propose the Semantic Guidance Attention (SGA) mechanism in this article. Specifically, SGA utilizes semantic word representations to provide an intuitive semantic guidance that focuses accurately on semantic-related regions. Moreover, we reduce the difficulty of generating fluent sentences by updating the attention information in time. At the same time, the beam search algorithm is widely used to predict words during sequence generation. This algorithm generates a sentence according to the probabilities of words, so it is easy to push out a generic sentence and discard some distinctive captions. In order to overcome this limitation, we design the Consensus Selection (CS) strategy to choose the most descriptive and informative caption, which is selected by the semantic similarity of captions instead of the probabilities of words. The consensus caption is determined by selecting the one with the highest cumulative semantic similarity with respect to the reference captions. Our proposed model (SGA-CS) is validated on Flickr30k and MSCOCO, which shows that SGA-CS outperforms state-of-the-art approaches. To our best knowledge, SGA-CS is the first attempt to jointly produce semantic attention guidance and select descriptive captions for image captioning tasks, achieving one of the best performance ratings among any cross-entropy training methods.

Variational Joint Self‐attention for Image Captioning

Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation

Image Captioning with a Joint Attention Mechanism by Visual Concept Samples

Image Captioning with Visual-Semantic Double Attention

Improve Image Captioning By Self-Attention

Image Captioning Via Semantic Guidance Attention and Consensus Selection Strategy.

VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation.

Image Captioning Based on Sentence-Level and Word-Level Attention

Image Captioning Algorithm Based on Sufficient Visual Information and Text Information

Adaptive Syncretic Attention for Constrained Image Captioning

Learning joint relationship attention network for image captioning

Improving Image Captioning through Visual and Semantic Mutual Promotion

A Cooperative Approach Based on Self-Attention with Interactive Attribute for Image Caption

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

Image captioning with weakly-supervised attention penalty

Variational Structured Semantic Inference for Diverse Image Captioning.

Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory

Image Captioning with Text-Based Visual Attention

GVA: guided visual attention approach for automatic image caption generation

Image Caption with Synchronous Cross-Attention