Abstract:Image captioning is one of the hot research topics in the field of computer vision.It is a cross-media data analysis task that combines computer vision and natural language processing.It describes the image by understanding the content of the image and generating captions that are both semantically and grammatically correct.Existing image captioning methods mostly use the encoder-decoder model.This kind of methods mostly ignore the relative position relationship between visual objects when extracting the visual object features in image,and the relative position relationship between objects is very important for generating accurate captioning.Based on this,this paper proposes a spatial encoding and multi-layer joint encoding enhanced transformer for image captioning.In order to make better use of the position information contained in the image,this paper proposes a spatial encoding mechanism for visual objects,which converts the independent spatial relationship of each visual object into the relative spatial relationship between visual objects to help the model to recognize the relative spatial relationship between each visual object.At the same time,in the encoder part of visual objects,the top encoding feature retains more semantic information that fits the image but loses part of the visual information of the image.Taking this into account,this paper proposes a multi-level joint encoding mechanism to improve the semantic information contained in the top encoding layer by integrating the image feature information contained in each shallow encoding layer,so as to obtain richer semantic features that fit the image.This paper evaluates the proposed image captioning method by multiple evaluation indicators(BLEU,METEOR,ROUGE-L,CIDEr,etc.) on the MSCOCO dataset.The ablation experiment proves that the spatial encoding mechanism and the multi-level joint encoding mechanism proposed in this paper can be helpful in generating more accurate and effective image captions.Comparative experimental results show that the proposed method in can produce accurate and effective image caption and is superior to most of the latest methods.

Image Captioning Based on An Improved Transformer with IoU Position Encoding

Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Tag‐inferring and tag‐guided Transformer for image captioning

Exploring Visual Relationships Via Transformer-based Graphs for Enhanced Image Captioning

Image Captioning: Transforming Objects into Words

Entangled Transformer for Image Captioning

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

Caption TLSTMs: Combining Transformer with LSTMs for Image Captioning

BENet: bi-directional enhanced network for image captioning

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Improved image captioning with subword units training and transformer

Improving OCR-based Image Captioning by Incorporating Geometrical Relationship

I2Transformer: Intra- and Inter-relation Embedding Transformer for TV Show Captioning

Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning

End-to-End Transformer Based Model for Image Captioning

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Image Captioning In the Transformer Age

Show, Deconfound and Tell: Image Captioning with Causal Inference

SPT: Spatial Pyramid Transformer for Image Captioning

Controllable image caption with an encoder-decoder optimization structure