Abstract:With the development of artificial intelligence and deep learning technologies, image captioning has become an important research direction at the intersection of computer vision and natural language processing. The purpose of image captioning is to generate corresponding natural language descriptions by understanding the content of images. This technology has broad application prospects in fields such as image retrieval, autonomous driving, and visual question answering. Currently, many researchers have proposed region-based image captioning methods. These methods generate captions by extracting features from different regions of an image. However, they often rely on local features of the image and overlook the understanding of the overall scene, leading to captions that lack coherence and accuracy when dealing with complex scenes. Additionally, image captioning methods are unable to extract complete semantic information from visual data, which may lead to captions with biases and deficiencies. Due to these reasons, existing methods struggle to generate comprehensive and accurate captions. To fill this gap, we propose the Semantic Scenes Encoder (SSE) for image captioning. It first extracts a scene graph from the image and integrates it into the encoding of the image information. Then, it extracts a semantic graph from the captions and preserves semantic information through a learnable attention mechanism, which we refer to as the dictionary. During the generation of captions, it combines the encoded information of the image and the learned semantic information to generate complete and accurate captions. To verify the effectiveness of the SSE, we tested the model on the MSCOCO dataset. The experimental results show that the SSE improves the overall quality of the captions. The improvement in scores across multiple evaluation metrics further demonstrates that the SSE possesses significant advantages when processing identical images.

DP-RSCAP: Dual Prompt-Based Scene and Entity Network for Remote Sensing Image Captioning

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning

Discrete diffusion models with Refined Language–Image Pre-trained representations for remote sensing image captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Progressive Scale-aware Network for Remote sensing Image Change Captioning

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Single-Stream Extractor Network With Contrastive Pre-Training for Remote-Sensing Change Captioning

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Dense captioning and multidimensional evaluations for indoor robotic scenes

Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection.

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Region-Focused Network for Dense Captioning

Image Captioning Based on Semantic Scenes

Exploring Models and Data for Remote Sensing Image Caption Generation

Changes to Captions: An Attentive Network for Remote Sensing Change Captioning