Abstract:Image captioning is one of the hot research topics in the field of computer vision.It is a cross-media data analysis task that combines computer vision and natural language processing.It describes the image by understanding the content of the image and generating captions that are both semantically and grammatically correct.Existing image captioning methods mostly use the encoder-decoder model.This kind of methods mostly ignore the relative position relationship between visual objects when extracting the visual object features in image,and the relative position relationship between objects is very important for generating accurate captioning.Based on this,this paper proposes a spatial encoding and multi-layer joint encoding enhanced transformer for image captioning.In order to make better use of the position information contained in the image,this paper proposes a spatial encoding mechanism for visual objects,which converts the independent spatial relationship of each visual object into the relative spatial relationship between visual objects to help the model to recognize the relative spatial relationship between each visual object.At the same time,in the encoder part of visual objects,the top encoding feature retains more semantic information that fits the image but loses part of the visual information of the image.Taking this into account,this paper proposes a multi-level joint encoding mechanism to improve the semantic information contained in the top encoding layer by integrating the image feature information contained in each shallow encoding layer,so as to obtain richer semantic features that fit the image.This paper evaluates the proposed image captioning method by multiple evaluation indicators(BLEU,METEOR,ROUGE-L,CIDEr,etc.) on the MSCOCO dataset.The ablation experiment proves that the spatial encoding mechanism and the multi-level joint encoding mechanism proposed in this paper can be helpful in generating more accurate and effective image captions.Comparative experimental results show that the proposed method in can produce accurate and effective image caption and is superior to most of the latest methods.

Image Captioning with a Constraint of Image-to-Text Transformation

A Denoising Framework for Image Caption.

Image Caption Method from Coarse to Fine Based on Dual Encoder-Decoder Framework

Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning

Show, Deconfound and Tell: Image Captioning with Causal Inference

Enhanced Modality Transition for Image Captioning

Improving Image Captioning with Better Use of Caption

Improving Image Captioning with Better Use of Captions

BENet: bi-directional enhanced network for image captioning

CLIP4Caption: CLIP for Video Caption

Controllable image caption with an encoder-decoder optimization structure

Image Captioning with Multi-Context Synthetic Data

Caption Feature Space Regularization for Audio Captioning

Unpaired Image Captioning With semantic-Constrained Self-Learning

Based-CLIP early fusion transformer for image caption

Tag‐inferring and tag‐guided Transformer for image captioning

Learning to Guide Decoding for Image Captioning

Object Modifier Generation for Image Captioning

Controllable Image Caption Based on Adaptive Weight and Optimization Strategy

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training