Abstract:Recently, a series of attempts have incorporated spatial attention mechanisms into the task of image captioning, which achieves a remarkable improvement in the quality of generative captions. However, the traditional spatial attention mechanism adopts latent and delayed semantic representations to decide which area should be paid more attention to, resulting in inaccurate semantic guidance and the introduction of redundant information. In order to optimize the spatial attention mechanism, we propose the Semantic Guidance Attention (SGA) mechanism in this article. Specifically, SGA utilizes semantic word representations to provide an intuitive semantic guidance that focuses accurately on semantic-related regions. Moreover, we reduce the difficulty of generating fluent sentences by updating the attention information in time. At the same time, the beam search algorithm is widely used to predict words during sequence generation. This algorithm generates a sentence according to the probabilities of words, so it is easy to push out a generic sentence and discard some distinctive captions. In order to overcome this limitation, we design the Consensus Selection (CS) strategy to choose the most descriptive and informative caption, which is selected by the semantic similarity of captions instead of the probabilities of words. The consensus caption is determined by selecting the one with the highest cumulative semantic similarity with respect to the reference captions. Our proposed model (SGA-CS) is validated on Flickr30k and MSCOCO, which shows that SGA-CS outperforms state-of-the-art approaches. To our best knowledge, SGA-CS is the first attempt to jointly produce semantic attention guidance and select descriptive captions for image captioning tasks, achieving one of the best performance ratings among any cross-entropy training methods.

Learning Consensus-Aware Semantic Knowledge for Remote Sensing Image Captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Remote Sensing Image Captioning with Sequential Attention and Flexible Word Correlation

Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning

Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning

Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Remote Sensing Image Captioning Based on Multi-Level Feature Extraction and Adaptive Attention

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Exploring Models and Data for Remote Sensing Image Caption Generation

High-Resolution Remote Sensing Image Captioning Based on Structured Attention

A Review of Deep Learning-Based Remote Sensing Image Caption: Methods, Models, Comparisons and Future Directions

Semantic Descriptions of High-Resolution Remote Sensing Images

Semantic-Guided Selective Representation for Image Captioning.

Multi-label Semantic Feature Fusion for Remote Sensing Image Captioning

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Image Captioning Via Semantic Guidance Attention and Consensus Selection Strategy.

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Word–Sentence Framework for Remote Sensing Image Captioning