Abstract:<p>Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related individual visual regions. It does not fully explore the relationships/interactions between visual regions. Furthermore, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly addressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can generate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relationship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions.</p>

Generating Spatial-aware Captions for TextCaps

Accurate and Complete Captions for Question-controlled Text-aware Image Captioning

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Improving OCR-based Image Captioning by Incorporating Geometrical Relationship

Visual contextual relationship augmented transformer for image captioning

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

Text-to-image Generation Based on Spatial-Channel Attention and Semantic Redescription

TextCaps: a Dataset for Image Captioning with Reading Comprehension

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

CapsFusion: Rethinking Image-Text Data at Scale

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Exploring Visual Relationship for Image Captioning

RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioning

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

Improving Image Captioning with Better Use of Caption

Exploring region relationships implicitly: Image captioning with visual relationship attention

Context-Aware Transformer for image captioning

TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption

Improving Image Captioning with Better Use of Captions

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset