Abstract:Image captioning is one of the hot research topics in the field of computer vision.It is a cross-media data analysis task that combines computer vision and natural language processing.It describes the image by understanding the content of the image and generating captions that are both semantically and grammatically correct.Existing image captioning methods mostly use the encoder-decoder model.This kind of methods mostly ignore the relative position relationship between visual objects when extracting the visual object features in image,and the relative position relationship between objects is very important for generating accurate captioning.Based on this,this paper proposes a spatial encoding and multi-layer joint encoding enhanced transformer for image captioning.In order to make better use of the position information contained in the image,this paper proposes a spatial encoding mechanism for visual objects,which converts the independent spatial relationship of each visual object into the relative spatial relationship between visual objects to help the model to recognize the relative spatial relationship between each visual object.At the same time,in the encoder part of visual objects,the top encoding feature retains more semantic information that fits the image but loses part of the visual information of the image.Taking this into account,this paper proposes a multi-level joint encoding mechanism to improve the semantic information contained in the top encoding layer by integrating the image feature information contained in each shallow encoding layer,so as to obtain richer semantic features that fit the image.This paper evaluates the proposed image captioning method by multiple evaluation indicators(BLEU,METEOR,ROUGE-L,CIDEr,etc.) on the MSCOCO dataset.The ablation experiment proves that the spatial encoding mechanism and the multi-level joint encoding mechanism proposed in this paper can be helpful in generating more accurate and effective image captions.Comparative experimental results show that the proposed method in can produce accurate and effective image caption and is superior to most of the latest methods.

Memory Positional Encoding for Image Captioning

Exploring Spatial-Based Position Encoding for Image Captioning

Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning

Memory-enhanced Hierarchical Transformer for Video Paragraph Captioning

Conditional Positional Encodings for Vision Transformers

BENet: bi-directional enhanced network for image captioning

Image Captioning with Memorized Knowledge

Memory-Augmented Image Captioning

Rethinking and Improving Relative Position Encoding for Vision Transformer

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning

Double-Stream Position Learning Transformer Network for Image Captioning

Positional Self-attention Based Hierarchical Image Captioning.

Entangled Transformer for Image Captioning

Memory-Attended Recurrent Network For Video Captioning

Enhanced Modality Transition for Image Captioning

Tag‐inferring and tag‐guided Transformer for image captioning

Object Modifier Generation for Image Captioning

A Simple and Effective Positional Encoding for Transformers

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning