GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Zuyao Chen,Jinlin Wu,Zhen Lei,Zhaoxiang Zhang,Changwen Chen
2024-06-02
Abstract:Training Scene Graph Generation (SGG) models with natural language captions has become increasingly popular due to the abundant, cost-effective, and open-world generalization supervision signals that natural language offers. However, such unstructured caption data and its processing pose significant challenges in learning accurate and comprehensive scene graphs. The challenges can be summarized as three aspects: 1) traditional scene graph parsers based on linguistic representation often fail to extract meaningful relationship triplets from caption data. 2) grounding unlocalized objects of parsed triplets will meet ambiguity issues in visual-language alignment. 3) caption data typically are sparse and exhibit bias to partial observations of image content. Aiming to address these problems, we propose a divide-and-conquer strategy with a novel framework named \textit{GPT4SGG}, to obtain more accurate and comprehensive scene graph signals. This framework decomposes a complex scene into a bunch of simple regions, resulting in a set of region-specific narratives. With these region-specific narratives (partial observations) and a holistic narrative (global observation) for an image, a large language model (LLM) performs the relationship reasoning to synthesize an accurate and comprehensive scene graph. Experimental results demonstrate \textit{GPT4SGG} significantly improves the performance of SGG models trained on image-caption data, in which the ambiguity issue and long-tail bias have been well-handled with more accurate and comprehensive scene graphs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper mainly discusses how to generate more accurate and comprehensive scene graphs from natural language descriptions. Scene graphs are visual symbolic representations that represent objects and their relationships in an image. Traditional methods for scene graph generation rely on manually annotated data, while recent research has utilized image caption data for language-supervised learning. The challenges include: 1) Traditional methods often fail to extract meaningful relationship triplets from caption data. 2) Ambiguities arise when locating unlocated objects in visual-language alignment. 3) Caption data is often sparse and biased, focusing only on parts of the image content and neglecting key visual cues for generating comprehensive scene graphs. To address these challenges, the paper proposes a new framework called GPT4SGG, which adopts a divide-and-conquer strategy. The framework first locates objects through annotation or object detectors to ensure the accuracy of visual-language alignment. Then, it decomposes complex scenes into a series of simple regions to generate local and global narratives, mitigating the bias from caption data. Finally, it utilizes a large language model (such as GPT-4) to infer relationships between objects based on localized objects and observed results for more precise inference. The paper validates the effectiveness of the GPT4SGG framework through two specialized instruction-following datasets and conducts experiments using a private LLM (Llama 2) fine-tuned with instruction data generated by GPT-4. The results show that GPT4SGG significantly improves the performance of scene graph generation models trained on image caption data, especially in handling ambiguity and long-tail distribution issues. In summary, the main contributions of the paper are: 1. Introduction of the GPT4SGG framework that utilizes a large language model (especially GPT-4) for scene graph generation, which is a groundbreaking work in this field. 2. Development of two dedicated instruction-following datasets for evaluating and enhancing the scene graph generation capabilities of LLMs in complex visual contexts. 3. Fine-tuning a private and scene graph-aware LLM (Llama 2) with instruction data generated by GPT-4. 4. Experimental results demonstrate that GPT4SGG can generate more accurate and comprehensive scene graphs.