Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Zuyao Chen,Jinlin Wu,Zhen Lei,Zhaoxiang Zhang,Changwen Chen
2024-10-07
Abstract:Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications. Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relationbased SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pretraining utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework. Our code is available at <a class="link-external link-https" href="https://github.com/gpt4vision/OvSGTR/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the open - vocabulary problem in Scene Graph Generation (SGG), especially the open - vocabulary recognition involving objects and relations. Traditional SGG methods are usually limited by the closed - set assumption, that is, they can only recognize predefined object and relation categories, which restricts their wide use in practical applications. To address this limitation, the authors propose a unified framework, OvSGTR (Open - vocabulary Scene Graph Transformers), aiming to achieve fully open - vocabulary scene graph generation. Specifically, the paper addresses the following key issues: 1. **Classification of Closed - set and Open - vocabulary Settings**: - The authors divide the SGG scenarios into four different settings: closed - set SGG, object - based open - vocabulary detection SGG (OvD - SGG), relation - based open - vocabulary SGG (OvR - SGG), and open - vocabulary SGG combining objects and relations (OvD + R - SGG). Each setting corresponds to different challenges, especially when dealing with unseen objects or relations. 2. **Challenges in Open - vocabulary Scene Graph Generation**: - Closed - set SGG can only recognize predefined categories, while open - vocabulary SGG needs to be able to recognize new objects and new relations that have not been seen during the training process. Especially when the model encounters both unseen objects and relations simultaneously, the task becomes more complex. 3. **Visual - Concept Alignment and Preservation**: - OvSGTR correlates image features and text features through a visual - concept alignment strategy, so that it can recognize unseen categories. Moreover, in order to deal with the catastrophic forgetting problem (that is, the model forgets the knowledge learned before when learning new data), the authors introduce a knowledge distillation strategy to ensure that the model can still maintain the understanding of existing relations when dealing with new relations. 4. **Experimental Verification**: - The paper conducts extensive experiments on the Visual Genome benchmark, and the results show that OvSGTR performs excellently in all settings, especially when dealing with open - vocabulary scenarios, significantly outperforming existing methods. In conclusion, the main goal of this paper is to overcome the closed - set assumption of traditional SGG methods by proposing the OvSGTR framework, achieve more extensive applications, and especially provide stronger generalization ability when facing unseen objects and relations.