Abstract:The image captioning task is among the most important tasks in computer vision. Most existing methods mine more useful contextual information from image features. Similarly, to mine more contextual information, this paper proposes a visual contextual relationship augmented transformer (VRAT) method for improving the correctness of image description statements. In VRAT, visual contextual features are enhanced by using a pre-trained visual contextual relationship augmented module (VRAM). In VRAM, we classify images into three categories: globe, object, and grid, and use encoders of CLIP and ResNext to encode images and text to supplement the original image descriptions with visual and textual features. Finally, a similarity retrieval model is constructed to match global features, object features, and grid features for contextual relationships. During model training, our model supplements the original image captioning model with global, object, and grid visual features and textual features. In addition, to improve the quality of the attention-focused image features, we propose an attention augmented module (AAM) that adds a compensated attention module to the original multi-head attention, which allows a large number of image features in the model to focus more on important information and filter out some unimportant attention information. To alleviate the imbalance of positive and negative samples during training, we propose a multi-label focal loss in the model and combine it with the original cross-entropy loss function to improve the performance of the model. Experiments on the MSCOCO image description benchmark dataset show that the proposed method can perform well and outperform many existing state-of-the-art methods. The improvement in the CIDEr score and BLEU-1 score over the baseline model was 7.7 and 1.5, respectively.

DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps

Generating Spatial-aware Captions for TextCaps

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Accurate and Complete Captions for Question-controlled Text-aware Image Captioning

Enhancing image captioning with depth information using a Transformer-based framework

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

EDTC: enhance depth of text comprehension in automated audio captioning

Visual contextual relationship augmented transformer for image captioning

Context-Aware Transformer for image captioning

Tag‐inferring and tag‐guided Transformer for image captioning

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Dual visual align-cross attention-based image captioning transformer

Transformer with multi-level grid features and depth pooling for image captioning

Introducing Depth into Transformer-based 3D Object Detection

CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning

Style-Enhanced Transformer for Image Captioning in Construction Scenes

Improving OCR-based Image Captioning by Incorporating Geometrical Relationship

Semantic association enhancement transformer with relative position for image captioning

Dual-level Collaborative Transformer for Image Captioning