Abstract:With the development of artificial intelligence and deep learning technologies, image captioning has become an important research direction at the intersection of computer vision and natural language processing. The purpose of image captioning is to generate corresponding natural language descriptions by understanding the content of images. This technology has broad application prospects in fields such as image retrieval, autonomous driving, and visual question answering. Currently, many researchers have proposed region-based image captioning methods. These methods generate captions by extracting features from different regions of an image. However, they often rely on local features of the image and overlook the understanding of the overall scene, leading to captions that lack coherence and accuracy when dealing with complex scenes. Additionally, image captioning methods are unable to extract complete semantic information from visual data, which may lead to captions with biases and deficiencies. Due to these reasons, existing methods struggle to generate comprehensive and accurate captions. To fill this gap, we propose the Semantic Scenes Encoder (SSE) for image captioning. It first extracts a scene graph from the image and integrates it into the encoding of the image information. Then, it extracts a semantic graph from the captions and preserves semantic information through a learnable attention mechanism, which we refer to as the dictionary. During the generation of captions, it combines the encoded information of the image and the learned semantic information to generate complete and accurate captions. To verify the effectiveness of the SSE, we tested the model on the MSCOCO dataset. The experimental results show that the SSE improves the overall quality of the captions. The improvement in scores across multiple evaluation metrics further demonstrates that the SSE possesses significant advantages when processing identical images.

Image Captioning with Semantic-Enhanced Features and Extremely Hard Negative Examples

Negative-Sensitive Framework With Semantic Enhancement for Composed Image Retrieval

Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Improving Image Captioning with Better Use of Caption

Improved Image Captioning via Semantic Feature Update

Improving Image Captioning with Better Use of Captions

Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning

More is Better: Precise and Detailed Image Captioning Using Online Positive Recall and Missing Concepts Mining.

Image Captioning with Visual-Semantic Double Attention

A Novel Image Captioning Model with Visual-Semantic Similarities and Visual Representations Re-Weighting

A Novel Semantic Attribute-Based Feature for Image Caption Generation.

Object Semantic Analysis for Image Captioning

Contour-Augmented Concept Prediction Network for Image Captioning

Semantic Descriptions of High-Resolution Remote Sensing Images

Unpaired Image Captioning With semantic-Constrained Self-Learning

Boosting convolutional image captioning with semantic content and visual relationship

End-to-End Non-Autoregressive Image Captioning

ITContrast: contrastive learning with hard negative synthesis for image-text matching

Image Captioning Based on Semantic Scenes

Incorporating retrieval-based method for feature enhanced image captioning