Abstract:Image captioning is one of the hot research topics in the field of computer vision.It is a cross-media data analysis task that combines computer vision and natural language processing.It describes the image by understanding the content of the image and generating captions that are both semantically and grammatically correct.Existing image captioning methods mostly use the encoder-decoder model.This kind of methods mostly ignore the relative position relationship between visual objects when extracting the visual object features in image,and the relative position relationship between objects is very important for generating accurate captioning.Based on this,this paper proposes a spatial encoding and multi-layer joint encoding enhanced transformer for image captioning.In order to make better use of the position information contained in the image,this paper proposes a spatial encoding mechanism for visual objects,which converts the independent spatial relationship of each visual object into the relative spatial relationship between visual objects to help the model to recognize the relative spatial relationship between each visual object.At the same time,in the encoder part of visual objects,the top encoding feature retains more semantic information that fits the image but loses part of the visual information of the image.Taking this into account,this paper proposes a multi-level joint encoding mechanism to improve the semantic information contained in the top encoding layer by integrating the image feature information contained in each shallow encoding layer,so as to obtain richer semantic features that fit the image.This paper evaluates the proposed image captioning method by multiple evaluation indicators(BLEU,METEOR,ROUGE-L,CIDEr,etc.) on the MSCOCO dataset.The ablation experiment proves that the spatial encoding mechanism and the multi-level joint encoding mechanism proposed in this paper can be helpful in generating more accurate and effective image captions.Comparative experimental results show that the proposed method in can produce accurate and effective image caption and is superior to most of the latest methods.

Image Caption Method from Coarse to Fine Based on Dual Encoder-Decoder Framework

Exploring Spatial-Based Position Encoding for Image Captioning

A Denoising Framework for Image Caption.

Fine-Grained Features for Image Captioning

An image caption model based on attention mechanism and deep reinforcement learning

Image Caption Generation via Unified Retrieval and Generation-Based Method

Incorporating retrieval-based method for feature enhanced image captioning

Remote Sensing Image Captioning Based on Multi-Level Feature Extraction and Adaptive Attention

An Image Captioning Algorithm Based on Combination Attention Mechanism

Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning

DFEN: Dual Feature Enhancement Network for Remote Sensing Image Caption

Fine-Grained Image Captioning with Global-Local Discriminative Objective.

BENet: bi-directional enhanced network for image captioning

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

Sequential Dual Attention: Coarse-to-Fine-Grained Hierarchical Generation for Image Captioning

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning.

Exploring refined dual visual features cross-combination for image captioning

OSIC: A New One-Stage Image Captioner Coined

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning

Chinese image captioning with fusion encoder and visual keyword search

CA-Captioner: A Novel Concentrated Attention for Image Captioning