Abstract:Image captioning (IC), bringing vision to language, has drawn extensive attention. A crucial aspect of IC is the accurate depiction of visual relations among image objects. Visual relations encompass two primary facets: content relations and structural relations. Content relations, which comprise geometric positions content (i.e., distances and sizes) and semantic interactions content (i.e., actions and possessives), unveil the mutual correlations between objects. In contrast, structural relations pertain to the topological connectivity of object regions. Existing Transformer-based methods typically resort to geometric positions to enhance the visual relations, yet only using the shallow geometric content is unable to precisely cover actional content correlations and structural connection relations. In this article, we adopt a comprehensive perspective to examine the correlations between objects, incorporating both content relations (i.e., geometric and semantic relations) and structural relations, with the aim of generating plausible captions. To achieve this, first, we construct a geometric graph from bounding box features and a semantic graph from the scene graph parser to model the content relations. Innovatively, we construct a topology graph that amalgamates the sparsity characteristics of the geometric and semantic graphs, enabling the representation of image structural relations. Second, we propose a novel unified approach to enrich image relation representations by integrating semantic, geometric, and structural relations into self-attention. Finally, in the language decoding stage, we further leverage the semantic relation as prior knowledge to generate accurate words. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our model, with improvements of CIDEr from 128.6% to 136.6%. Codes have been released at https://github.com/CrossmodalGroup/ER-SAN/tree/main/VG-Cap .

Benefit from AMR: Image Captioning with Explicit Relations and Endogenous Knowledge

Model Semantic Relations with Extended Attributes

Improving Image Captioning with Better Use of Caption

Improving Image Captioning with Better Use of Captions

Explicit Image Caption Reasoning: Generating Accurate and Informative Captions for Complex Scenes with LMM

Exploring Explicit and Implicit Visual Relationships for Image Captioning

From Less to More: Common-Sense Semantic Perception Benefits Image Captioning.

Neural Symbolic Representation Learning for Image Captioning

Exploring Visual Relationships Via Transformer-based Graphs for Enhanced Image Captioning

Image Captioning With Relational Knowledge

Context-Driven Image Caption With Global Semantic Relations Of The Named Entities

Say As You Wish: Fine-Grained Control of Image Caption Generation with Abstract Scene Graphs

Image Captioning with Emotional Information Via Multiple Model

A Novel Image Captioning Model with Visual-Semantic Similarities and Visual Representations Re-Weighting

Image Captioning using Facial Expression and Attention

Improved Image Captioning via Semantic Feature Update

Video Captioning with External Knowledge Assistance and Multi-feature Fusion

Chinese image captioning with fusion encoder and visual keyword search

Joint Common Sense and Relation Reasoning for Dense Relational Captioning

Adaptive Syncretic Attention for Constrained Image Captioning

A Survey on Recent Advances in Image Captioning