Abstract:Image captioning (IC), bringing vision to language, has drawn extensive attention. A crucial aspect of IC is the accurate depiction of visual relations among image objects. Visual relations encompass two primary facets: content relations and structural relations. Content relations, which comprise geometric positions content (i.e., distances and sizes) and semantic interactions content (i.e., actions and possessives), unveil the mutual correlations between objects. In contrast, structural relations pertain to the topological connectivity of object regions. Existing Transformer-based methods typically resort to geometric positions to enhance the visual relations, yet only using the shallow geometric content is unable to precisely cover actional content correlations and structural connection relations. In this article, we adopt a comprehensive perspective to examine the correlations between objects, incorporating both content relations (i.e., geometric and semantic relations) and structural relations, with the aim of generating plausible captions. To achieve this, first, we construct a geometric graph from bounding box features and a semantic graph from the scene graph parser to model the content relations. Innovatively, we construct a topology graph that amalgamates the sparsity characteristics of the geometric and semantic graphs, enabling the representation of image structural relations. Second, we propose a novel unified approach to enrich image relation representations by integrating semantic, geometric, and structural relations into self-attention. Finally, in the language decoding stage, we further leverage the semantic relation as prior knowledge to generate accurate words. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our model, with improvements of CIDEr from 128.6% to 136.6%. Codes have been released at https://github.com/CrossmodalGroup/ER-SAN/tree/main/VG-Cap .

Adaptive Semantic-Enhanced Transformer for Image Captioning.

Vision-Enhanced and Consensus-Aware Transformer for Image Captioning

Semantic association enhancement transformer with relative position for image captioning

Tag‐inferring and tag‐guided Transformer for image captioning

Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning

Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning

End-to-End Transformer Based Model for Image Captioning

Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning

Improving Image Captioning through Visual and Semantic Mutual Promotion

Adaptive Syncretic Attention for Constrained Image Captioning

Adaptive semantic guidance network for video captioning

An Image Captioning Algorithm Based on Combination Attention Mechanism

Boosted Transformer for Image Captioning

Context-Aware Transformer for image captioning

Entangled Transformer for Image Captioning

Layer-wise enhanced transformer with multi-modal fusion for image caption

Caption TLSTMs: Combining Transformer with LSTMs for Image Captioning

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

Exploring Visual Relationships Via Transformer-based Graphs for Enhanced Image Captioning

Knowing What It Is: Semantic-Enhanced Dual Attention Transformer