Abstract:Image captioning generates descriptions in a natural language for a given image. Due to its great potential for a wide range of applications, many deep learning based-methods have been proposed. The co-occurrence of words such as mouse and keyboard, constitutes commonsense knowledge, which is referred to as consensus. However, it is challenging to consider commonsense knowledge in producing captions that have rich, natural, and meaningful semantics. In this paper, a Vision-enhanced and Consensus-aware Transformer (VCT) is proposed to exploit both visual information and consensus knowledge for image captioning with three key components: a vision-enhanced encoder, consensus-aware knowledge representation generator, and consensus-aware decoder. The vision-enhanced encoder extends the vanilla self-attention module with a memory-based attention module and a visual perception module for learning better visual representation of an image. Specifically, the relationships between regions in an image and the image’s global context are leveraged with scene memory in the memory-based attention module. The visual perception module further enhances the correlation among neighboring tokens in both the spatial and channel-wise dimensions. To learn consensus-aware representations, a word correlation graph is constructed by computing the statistical co-occurrence between semantic concepts. Then consensus knowledge can be acquired using a graph convolutional network in the consensus-aware knowledge representation generator. Finally, such consensus knowledge is integrated into the consensus-aware decoder through consensus memory and a knowledge-based control module to produce a caption. Experimental results on two popular benchmark datasets (MSCOCO and Flickr30k) demonstrate that our proposed model achieves state-of-the-art performance. Extensive ablation studies also validate the effectiveness of each component.

GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning

M3T: Multi-Modal Medical Transformer to bridge Clinical Context with Visual Insights for Retinal Image Medical Description Generation

Reinforced Transformer for Medical Image Captioning.

Bidirectional Captioning for Clinically Accurate and Interpretable Models

Visual contextual relationship augmented transformer for image captioning

GVA: guided visual attention approach for automatic image caption generation

Improving Image Captioning via Enhancing Dual-Side Context Awareness

Guided Context Gating: Learning to leverage salient lesions in retinal fundus images

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

Enhancing medical image analysis: A fusion of fully connected neural network classifier with CNN-VIT for improved retinal disease detection

Optimizing Medical Image Report Generation with Varied Attention Mechanisms

Context-Aware Transformer for image captioning

Boosted Transformer for Image Captioning

Multi-label classification of retinal disease via a novel vision transformer model

Medical Vision-Language Pre-Training for Brain Abnormalities

"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning

UIT-DarkCow team at ImageCLEFmedical Caption 2024: Diagnostic Captioning for Radiology Images Efficiency with Transformer Models

Vision Transformer and Language Model Based Radiology Report Generation

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Vision-Enhanced and Consensus-Aware Transformer for Image Captioning