Abstract:Recently, a series of attempts have incorporated spatial attention mechanisms into the task of image captioning, which achieves a remarkable improvement in the quality of generative captions. However, the traditional spatial attention mechanism adopts latent and delayed semantic representations to decide which area should be paid more attention to, resulting in inaccurate semantic guidance and the introduction of redundant information. In order to optimize the spatial attention mechanism, we propose the Semantic Guidance Attention (SGA) mechanism in this article. Specifically, SGA utilizes semantic word representations to provide an intuitive semantic guidance that focuses accurately on semantic-related regions. Moreover, we reduce the difficulty of generating fluent sentences by updating the attention information in time. At the same time, the beam search algorithm is widely used to predict words during sequence generation. This algorithm generates a sentence according to the probabilities of words, so it is easy to push out a generic sentence and discard some distinctive captions. In order to overcome this limitation, we design the Consensus Selection (CS) strategy to choose the most descriptive and informative caption, which is selected by the semantic similarity of captions instead of the probabilities of words. The consensus caption is determined by selecting the one with the highest cumulative semantic similarity with respect to the reference captions. Our proposed model (SGA-CS) is validated on Flickr30k and MSCOCO, which shows that SGA-CS outperforms state-of-the-art approaches. To our best knowledge, SGA-CS is the first attempt to jointly produce semantic attention guidance and select descriptive captions for image captioning tasks, achieving one of the best performance ratings among any cross-entropy training methods.

Stimulus-driven and Concept-Driven Analysis for Image Caption Generation

Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation

Image Caption Generation with High-Level Image Features

An Image Captioning Algorithm Based on Combination Attention Mechanism

Image Caption Description of Traffic Scene Based on Deep Learning

Image Captioning Based on Sentence-Level and Word-Level Attention

Scene Attention Mechanism For Remote Sensing Image Caption Generation

Image Captioning with a Joint Attention Mechanism by Visual Concept Samples

Looking Deeper and Transferring Attention for Image Captioning.

Combining Object-Based Attention And Attributes For Image Captioning

Image Caption with Endogenous–Exogenous Attention

Object-aware Semantics of Attention for Image Captioning

Topic-Guided Attention for Image Captioning

Image Caption with Synchronous Cross-Attention

Time-Dependent Pre-Attention Model For Image Captioning

Contextual and Selective Attention Networks for Image Captioning

Image Captioning Via Semantic Guidance Attention and Consensus Selection Strategy.

Adaptive Syncretic Attention for Constrained Image Captioning

Improving Image Captioning through Visual and Semantic Mutual Promotion

GateCap: Gated Spatial and Semantic Attention Model for Image Captioning