Abstract:In this work, we present a tag‐inferring and tag‐guided Transformer for image captioning to explore the role of tags. First, tag‐inferring encoder (TIE) is designed to infer tags with rich semantic information by combining tags provided by the scene graph model and image features extracted by the detection model. Then, tag‐guided decoder (TGD), including short‐term attention (STA) and gated cross‐modal attention, is proposed to decode sentences with the semantic information provided by tags and image features. Image captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag‐inferring and tag‐guided Transformer for image captioning to generate fine‐grained captions. First, a tag‐inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag‐guided decoder that includes short‐term attention to improve the features of words in the sentence and gated cross‐modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors' method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU‐4 score and 135.3% CIDEr score on the MSCOCO data set.

TransCP: A Transformer Pointer Network for Generic Entity Description Generation with Explicit Content-Planning

Sentence Generation for Entity Description with Content-Plan Attention.

GCP: Graph Encoder with Content-Planning for Sentence Generation from Knowledge Base

GCP: Graph Encoder with Content-Planning for Sentence Generation from Knowledge Bases.

Grouped-Attention for Content-Selection and Content-Plan Generation.

Neural data-to-text generation with dynamic content planning

Plan, Attend, Generate: Character-level Neural Machine Translation with Planning in the Decoder

Reach the Remote Neighbors: Dual-Encoding Transformer for Graphs

Paraphrase Generation Model Integrating Transformer Architecture, Part-of-Speech Features, and Pointer Generator Network

R2D2: Relational Text Decoding with Transformers

CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

Generating Coherent Narratives by Learning Dynamic and Discrete Entity States with a Contrastive Framework

Order-Planning Neural Text Generation from Structured Data

A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Enhancing Text Representations Separately with Entity Descriptions

Controllable Neural Dialogue Summarization with Personal Named Entity Planning

Infobox-to-text Generation with Tree-like Planning Based Attention Network

Tag‐inferring and tag‐guided Transformer for image captioning

GRET: Global Representation Enhanced Transformer

TENER: Adapting Transformer Encoder for Name Entity Recognition