Abstract:Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relationbased SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pretraining utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework. Our code is available at <a class="link-external link-https" href="https://github.com/gpt4vision/OvSGTR/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the open - vocabulary problem in Scene Graph Generation (SGG), especially the open - vocabulary recognition involving objects and relations. Traditional SGG methods are usually limited by the closed - set assumption, that is, they can only recognize predefined object and relation categories, which restricts their wide use in practical applications. To address this limitation, the authors propose a unified framework, OvSGTR (Open - vocabulary Scene Graph Transformers), aiming to achieve fully open - vocabulary scene graph generation. Specifically, the paper addresses the following key issues: 1. **Classification of Closed - set and Open - vocabulary Settings**: - The authors divide the SGG scenarios into four different settings: closed - set SGG, object - based open - vocabulary detection SGG (OvD - SGG), relation - based open - vocabulary SGG (OvR - SGG), and open - vocabulary SGG combining objects and relations (OvD + R - SGG). Each setting corresponds to different challenges, especially when dealing with unseen objects or relations. 2. **Challenges in Open - vocabulary Scene Graph Generation**: - Closed - set SGG can only recognize predefined categories, while open - vocabulary SGG needs to be able to recognize new objects and new relations that have not been seen during the training process. Especially when the model encounters both unseen objects and relations simultaneously, the task becomes more complex. 3. **Visual - Concept Alignment and Preservation**: - OvSGTR correlates image features and text features through a visual - concept alignment strategy, so that it can recognize unseen categories. Moreover, in order to deal with the catastrophic forgetting problem (that is, the model forgets the knowledge learned before when learning new data), the authors introduce a knowledge distillation strategy to ensure that the model can still maintain the understanding of existing relations when dealing with new relations. 4. **Experimental Verification**: - The paper conducts extensive experiments on the Visual Genome benchmark, and the results show that OvSGTR performs excellently in all settings, especially when dealing with open - vocabulary scenarios, significantly outperforming existing methods. In conclusion, the main goal of this paper is to overcome the closed - set assumption of traditional SGG methods by proposing the OvSGTR framework, achieve more extensive applications, and especially provide stronger generalization ability when facing unseen objects and relations.

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Learning to Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space

Towards Open-Vocabulary Scene Graph Generation with Prompt-Based Finetuning.

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Open-Vocabulary Object Detection via Scene Graph Discovery

Fast Contextual Scene Graph Generation with Unbiased Context Augmentation.

Scene Graph Generation with Role-Playing Large Language Models

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Fine‐Grained Scene Graph Generation with Overlap Region and Geometrical Center

Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection

Open-Vocabulary Octree-Graph for 3D Scene Understanding

Toward a Unified Transformer-Based Framework for Scene Graph Generation and Human-Object Interaction Detection

SGTR+: End-to-end Scene Graph Generation with Transformer

OV-VG: A benchmark for open-vocabulary visual grounding

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Explore Contextual Information for 3D Scene Graph Generation

Visual Distant Supervision for Scene Graph Generation

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Boosting Scene Graph Generation with Contextual Information

Learning to Generate Scene Graph from Head to Tail