Abstract:Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is Scene Graph Generation (SGG) in the open - vocabulary scenario. Specifically, existing SGG methods have difficulties in handling the generation of scene graphs containing novel visual - relationship concepts, that is, when unseen relationships or entities appear in the scene, the performance of existing methods is poor. To overcome this challenge, the authors propose a new framework based on sequence generation, using pre - trained Vision - Language Models (VLMs) to generate open - vocabulary scene graphs. This method can not only generate scene graphs containing known and new visual - relationship triples, but also enhance the performance of downstream visual - language tasks through explicit relationship modeling. ### Main contributions of the paper: 1. **Propose a new framework**: This framework can solve the SGG problem in a more general open - vocabulary setting, that is, directly generate scene graphs containing known and new visual - relationship triples from image pixels. 2. **Introduce scene - graph prompts and relation - aware transformation modules**: These components enable the model to learn and generate scene graphs more efficiently. 3. **Achieve excellent performance on multiple benchmark datasets**: This framework performs well in the generalized open - vocabulary SGG benchmark tests and also obtains significant performance improvements in downstream visual - language tasks. ### Specific problems solved: - **Generate scene graphs containing new visual relationships**: Existing SGG methods can usually only handle a limited number of visual - relationship categories and are difficult to cope with the diversity and complexity in the real world. The method in this paper can generate scene graphs containing new visual relationships by using pre - trained VLMs. - **Enhance the performance of downstream visual - language tasks**: By generating high - quality scene graphs, this method can provide richer structured information for downstream tasks (such as visual question answering, image captioning, etc.), thereby improving the performance of these tasks. ### Technical details: - **Scene - graph sequence generation**: By designing specific scene - graph prompts, the scene - graph generation task is transformed into an image - to - text generation task, and pre - trained VLMs are used to generate scene - graph sequences. - **Relation - triple construction**: Extract the position and category information of entities from the generated scene - graph sequences to construct the final scene graph. - **Adapt to downstream visual - language tasks**: By fine - tuning VLMs, transfer the knowledge of scene - graph generation to other visual - language tasks to improve the performance of these tasks. ### Experimental results: - **Perform excellently on multiple SGG benchmark datasets**: For example, on the Visual Genome, Panoptic Scene Graph Generation and OpenImages V6 datasets, the performance of this method in the open - vocabulary setting is significantly better than that of existing methods. - **Achieve consistent improvements in downstream visual - language tasks**: By transferring the knowledge of scene - graph generation to other tasks, this method also performs well in visual question answering, image captioning and other tasks. In conclusion, this paper proposes an innovative framework, solves the problem of generating high - quality scene graphs in the open - vocabulary scenario, and demonstrates its effectiveness and superiority in multiple tasks.

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Scene Graph Generation with Role-Playing Large Language Models

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Towards Lifelong Scene Graph Generation with Knowledge-ware In-context Prompt Learning

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Open-Vocabulary Object Detection via Scene Graph Discovery

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Learning to Generate Scene Graph from Natural Language Supervision

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

Scene Graph Generation for Better Image Captioning?

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

SGTR+: End-to-end Scene Graph Generation with Transformer

MLMG-SGG: Multilabel Scene Graph Generation with Multigrained Features.

Learning Canonical Representations for Scene Graph to Image Generation

Scene Graph Generation: A Comprehensive Survey

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Tackling the Challenges in Scene Graph Generation With Local-to-Global Interactions