Abstract:Training Scene Graph Generation (SGG) models with natural language captions has become increasingly popular due to the abundant, cost-effective, and open-world generalization supervision signals that natural language offers. However, such unstructured caption data and its processing pose significant challenges in learning accurate and comprehensive scene graphs. The challenges can be summarized as three aspects: 1) traditional scene graph parsers based on linguistic representation often fail to extract meaningful relationship triplets from caption data. 2) grounding unlocalized objects of parsed triplets will meet ambiguity issues in visual-language alignment. 3) caption data typically are sparse and exhibit bias to partial observations of image content. Aiming to address these problems, we propose a divide-and-conquer strategy with a novel framework named \textit{GPT4SGG}, to obtain more accurate and comprehensive scene graph signals. This framework decomposes a complex scene into a bunch of simple regions, resulting in a set of region-specific narratives. With these region-specific narratives (partial observations) and a holistic narrative (global observation) for an image, a large language model (LLM) performs the relationship reasoning to synthesize an accurate and comprehensive scene graph. Experimental results demonstrate \textit{GPT4SGG} significantly improves the performance of SGG models trained on image-caption data, in which the ambiguity issue and long-tail bias have been well-handled with more accurate and comprehensive scene graphs.

What problem does this paper attempt to address?

This paper mainly discusses how to generate more accurate and comprehensive scene graphs from natural language descriptions. Scene graphs are visual symbolic representations that represent objects and their relationships in an image. Traditional methods for scene graph generation rely on manually annotated data, while recent research has utilized image caption data for language-supervised learning. The challenges include: 1) Traditional methods often fail to extract meaningful relationship triplets from caption data. 2) Ambiguities arise when locating unlocated objects in visual-language alignment. 3) Caption data is often sparse and biased, focusing only on parts of the image content and neglecting key visual cues for generating comprehensive scene graphs. To address these challenges, the paper proposes a new framework called GPT4SGG, which adopts a divide-and-conquer strategy. The framework first locates objects through annotation or object detectors to ensure the accuracy of visual-language alignment. Then, it decomposes complex scenes into a series of simple regions to generate local and global narratives, mitigating the bias from caption data. Finally, it utilizes a large language model (such as GPT-4) to infer relationships between objects based on localized objects and observed results for more precise inference. The paper validates the effectiveness of the GPT4SGG framework through two specialized instruction-following datasets and conducts experiments using a private LLM (Llama 2) fine-tuned with instruction data generated by GPT-4. The results show that GPT4SGG significantly improves the performance of scene graph generation models trained on image caption data, especially in handling ambiguity and long-tail distribution issues. In summary, the main contributions of the paper are: 1. Introduction of the GPT4SGG framework that utilizes a large language model (especially GPT-4) for scene graph generation, which is a groundbreaking work in this field. 2. Development of two dedicated instruction-following datasets for evaluating and enhancing the scene graph generation capabilities of LLMs in complex visual contexts. 3. Fine-tuning a private and scene graph-aware LLM (Llama 2) with instruction data generated by GPT-4. 4. Experimental results demonstrate that GPT4SGG can generate more accurate and comprehensive scene graphs.

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Learning to Generate Scene Graph from Natural Language Supervision

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Transforming Visual Scene Graphs to Image Captions

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Scene Graph Generation with Role-Playing Large Language Models

Video Scene Graph Generation from Single-Frame Weak Supervision.

Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

In Defense of Scene Graphs for Image Captioning

Auto-Encoding Scene Graphs for Image Captioning

Towards Lifelong Scene Graph Generation with Knowledge-ware In-context Prompt Learning

Scene Graph Generation for Better Image Captioning?

Auto-Encoding and Distilling Scene Graphs for Image Captioning

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation.

SGTR+: End-to-end Scene Graph Generation with Transformer

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation