Abstract:We propose scene graph auto-encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inferences in discourse. For example, when we see the relation "a person on a bike", it is natural to replace "on" with "ride" and infer "a person riding a bike on a road" even when the "road" is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models reason as we humans and generate more descriptive captions. Specifically, we use the scene graph-a directed graph ( G) where an object node is connected by adjective nodes and relationship nodes-to represent the complex structural layout of both image ( I) and sentence ( S). In the language domain, we use SGAE to learn a dictionary set ( D) that helps reconstruct sentences in the S→ GS → D → S auto-encoding pipeline, where D encodes the desired language prior and the decoder learns to caption from such a prior; in the vision-language domain, we share D in the I→ GI → D → S pipeline and distill the knowledge of the language decoder of the auto-encoder to that of the encoder-decoder based image captioner to transfer the language inductive bias. In this way, the shared D provides hidden embeddings about descriptive collocations to the encoder-decoder and the distillation strategy teaches the encoder-decoder to transform these embeddings to human-like captions as the auto-encoder. Thanks to the scene graph representation, the shared dictionary set, and the Knowledge Distillation strategy, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, where our SGAE-based single-model achieves a new state-of-the-art 129.6 CIDEr-D on the Karpathy split, and a competitive 126.6 CIDEr-D (c40) on the official server, which is even comparable to other ensemble models. Furthermore, we validate the transferability of SGAE on two more challenging settings: transferring inductive bias from other language corpora and unpaired image captioning. Once again, the results of both settings confirm the superiority of SGAE. The code is released in https://github.com/yangxuntu/SGAE.

Topic Scene Graphs for Image Captioning

Topic Scene Graph Generation by Attention Distillation from Caption

Scene Graph Generation for Better Image Captioning?

Transforming Visual Scene Graphs to Image Captions

In Defense of Scene Graphs for Image Captioning

Are scene graphs good enough to improve Image Captioning?

Image Captioning with Scene-graph Based Semantic Concepts.

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Image Caption Generation Method Based on Knowledge Graph Guidance and Self-Attention Mechanism

Image Captioning with Novel Topics Guidance and Retrieval-based Topics Re-weighting

Image Captioning Based on Semantic Scenes

Tag‐inferring and tag‐guided Transformer for image captioning

Auto-Encoding Scene Graphs for Image Captioning

Image paragraph captioning with topic clustering and topic shift prediction

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Intention Oriented Image Captions With Guiding Objects

Auto-Encoding and Distilling Scene Graphs for Image Captioning

TPsgtR: Neural-Symbolic Tensor Product Scene-Graph-Triplet Representation for Image Captioning

Text Pared into Scene Graph for Diverse Image Generation.