Abstract:Image captioning is one of the hot research topics in the field of computer vision.It is a cross-media data analysis task that combines computer vision and natural language processing.It describes the image by understanding the content of the image and generating captions that are both semantically and grammatically correct.Existing image captioning methods mostly use the encoder-decoder model.This kind of methods mostly ignore the relative position relationship between visual objects when extracting the visual object features in image,and the relative position relationship between objects is very important for generating accurate captioning.Based on this,this paper proposes a spatial encoding and multi-layer joint encoding enhanced transformer for image captioning.In order to make better use of the position information contained in the image,this paper proposes a spatial encoding mechanism for visual objects,which converts the independent spatial relationship of each visual object into the relative spatial relationship between visual objects to help the model to recognize the relative spatial relationship between each visual object.At the same time,in the encoder part of visual objects,the top encoding feature retains more semantic information that fits the image but loses part of the visual information of the image.Taking this into account,this paper proposes a multi-level joint encoding mechanism to improve the semantic information contained in the top encoding layer by integrating the image feature information contained in each shallow encoding layer,so as to obtain richer semantic features that fit the image.This paper evaluates the proposed image captioning method by multiple evaluation indicators(BLEU,METEOR,ROUGE-L,CIDEr,etc.) on the MSCOCO dataset.The ablation experiment proves that the spatial encoding mechanism and the multi-level joint encoding mechanism proposed in this paper can be helpful in generating more accurate and effective image captions.Comparative experimental results show that the proposed method in can produce accurate and effective image caption and is superior to most of the latest methods.

SPT: Spatial Pyramid Transformer for Image Captioning

S2 Transformer for Image Captioning

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Exploring Spatial-Based Position Encoding for Image Captioning

SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation

Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

A Position-Aware Transformer for Image Captioning

Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning

Prior Knowledge-Guided Transformer for Remote Sensing Image Captioning

Tag‐inferring and tag‐guided Transformer for image captioning

Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning

End-to-End Transformer Based Model for Image Captioning

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Image Captioning Based on An Improved Transformer with IoU Position Encoding

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer

Diverse Image Captioning Via Panoptic Segmentation and Sequential Conditional Variational Transformer

Double-Stream Position Learning Transformer Network for Image Captioning

Exploring Visual Relationships Via Transformer-based Graphs for Enhanced Image Captioning

Transformer with multi-level grid features and depth pooling for image captioning

Captioning Transformer with Stacked Attention Modules