Abstract:The image captioning task is among the most important tasks in computer vision. Most existing methods mine more useful contextual information from image features. Similarly, to mine more contextual information, this paper proposes a visual contextual relationship augmented transformer (VRAT) method for improving the correctness of image description statements. In VRAT, visual contextual features are enhanced by using a pre-trained visual contextual relationship augmented module (VRAM). In VRAM, we classify images into three categories: globe, object, and grid, and use encoders of CLIP and ResNext to encode images and text to supplement the original image descriptions with visual and textual features. Finally, a similarity retrieval model is constructed to match global features, object features, and grid features for contextual relationships. During model training, our model supplements the original image captioning model with global, object, and grid visual features and textual features. In addition, to improve the quality of the attention-focused image features, we propose an attention augmented module (AAM) that adds a compensated attention module to the original multi-head attention, which allows a large number of image features in the model to focus more on important information and filter out some unimportant attention information. To alleviate the imbalance of positive and negative samples during training, we propose a multi-label focal loss in the model and combine it with the original cross-entropy loss function to improve the performance of the model. Experiments on the MSCOCO image description benchmark dataset show that the proposed method can perform well and outperform many existing state-of-the-art methods. The improvement in the CIDEr score and BLEU-1 score over the baseline model was 7.7 and 1.5, respectively.

RVAIC: Refined visual attention for improved image captioning

GVA: guided visual attention approach for automatic image caption generation

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Attention on Attention for Image Captioning

Visual contextual relationship augmented transformer for image captioning

RNIC-A retrospect network for image captioning

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Local-global Visual Interaction Attention for Image Captioning

Bi-Directional Co-Attention Network for Image Captioning

An Image Captioning Approach Using Dynamical Attention.

Learning visual relationship and context-aware attention for image captioning

An Improved Attention and Hybrid Optimization Technique for Visual Question Answering

Image Captioning with a Joint Attention Mechanism by Visual Concept Samples

Adaptively Aligned Image Captioning via Adaptive Attention Time

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

Dynamic-balanced Double-Attention Fusion for Image Captioning

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

VSCA: A Sentence Matching Model Incorporating Visual Perception

Image captioning with weakly-supervised attention penalty