Abstract:The image captioning task is among the most important tasks in computer vision. Most existing methods mine more useful contextual information from image features. Similarly, to mine more contextual information, this paper proposes a visual contextual relationship augmented transformer (VRAT) method for improving the correctness of image description statements. In VRAT, visual contextual features are enhanced by using a pre-trained visual contextual relationship augmented module (VRAM). In VRAM, we classify images into three categories: globe, object, and grid, and use encoders of CLIP and ResNext to encode images and text to supplement the original image descriptions with visual and textual features. Finally, a similarity retrieval model is constructed to match global features, object features, and grid features for contextual relationships. During model training, our model supplements the original image captioning model with global, object, and grid visual features and textual features. In addition, to improve the quality of the attention-focused image features, we propose an attention augmented module (AAM) that adds a compensated attention module to the original multi-head attention, which allows a large number of image features in the model to focus more on important information and filter out some unimportant attention information. To alleviate the imbalance of positive and negative samples during training, we propose a multi-label focal loss in the model and combine it with the original cross-entropy loss function to improve the performance of the model. Experiments on the MSCOCO image description benchmark dataset show that the proposed method can perform well and outperform many existing state-of-the-art methods. The improvement in the CIDEr score and BLEU-1 score over the baseline model was 7.7 and 1.5, respectively.

Variational Transformer: A Framework Beyond the Trade-off Between Accuracy and Diversity for Image Captioning

Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

Partial Off-policy Learning: Balance Accuracy and Diversity for Human-Oriented Image Captioning

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning

VOLTA: Improving Generative Diversity by Variational Mutual Information Maximizing Autoencoder

Visual contextual relationship augmented transformer for image captioning

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

Show, tell and rectify: Boost image caption generation via an output rectifier

Diverse and Controllable Image Captioning with Part-of-Speech Guidance.

Diverse Image Captioning Via GroupTalk

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Fast, Diverse and Accurate Image Captioning Guided by Part-of-Speech

ViTOC: Vision Transformer and Object-aware Captioner

Transformer with token attention and attribute prediction for image captioning

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Multi-scale features with temporal information guidance for video captioning