Abstract:Recently, a series of attempts have incorporated spatial attention mechanisms into the task of image captioning, which achieves a remarkable improvement in the quality of generative captions. However, the traditional spatial attention mechanism adopts latent and delayed semantic representations to decide which area should be paid more attention to, resulting in inaccurate semantic guidance and the introduction of redundant information. In order to optimize the spatial attention mechanism, we propose the Semantic Guidance Attention (SGA) mechanism in this article. Specifically, SGA utilizes semantic word representations to provide an intuitive semantic guidance that focuses accurately on semantic-related regions. Moreover, we reduce the difficulty of generating fluent sentences by updating the attention information in time. At the same time, the beam search algorithm is widely used to predict words during sequence generation. This algorithm generates a sentence according to the probabilities of words, so it is easy to push out a generic sentence and discard some distinctive captions. In order to overcome this limitation, we design the Consensus Selection (CS) strategy to choose the most descriptive and informative caption, which is selected by the semantic similarity of captions instead of the probabilities of words. The consensus caption is determined by selecting the one with the highest cumulative semantic similarity with respect to the reference captions. Our proposed model (SGA-CS) is validated on Flickr30k and MSCOCO, which shows that SGA-CS outperforms state-of-the-art approaches. To our best knowledge, SGA-CS is the first attempt to jointly produce semantic attention guidance and select descriptive captions for image captioning tasks, achieving one of the best performance ratings among any cross-entropy training methods.

Image Captioning with Affective Guiding and Selective Attention

Image Captioning by Incorporating Affective Concepts Learned from Both Visual and Textual Components.

Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions

Image Captioning using Facial Expression and Attention

Senti-Attend: Image Captioning using Sentiment and Attention

Automatic Image Description Generation with Emotional Classifiers

Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation

Emotional Video Captioning With Vision-Based Emotion Interpretation Network

Fine-grained image emotion captioning based on Generative Adversarial Networks

SentiCap: Generating Image Descriptions with Sentiments

Reference Based On Adaptive Attention Mechanism For Image Captioning

Image Captioning Via Semantic Guidance Attention and Consensus Selection Strategy.

An Image Captioning Approach Using Dynamical Attention.

Adaptive Syncretic Attention for Constrained Image Captioning

Image Captioning with Inherent Sentiment

Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory

An Image Captioning Algorithm Based on Combination Attention Mechanism

Image Captioning with Visual-Semantic Double Attention

Towards Personalized Aesthetic Image Caption

Affective image classification by jointly using interpretable art features and semantic annotations

A Multi-Level Attention Model For Remote Sensing Image Captions