From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Rongjie Li,Songyang Zhang,Dahua Lin,Kai Chen,Xuming He
2024-04-24
Abstract:Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is Scene Graph Generation (SGG) in the open - vocabulary scenario. Specifically, existing SGG methods have difficulties in handling the generation of scene graphs containing novel visual - relationship concepts, that is, when unseen relationships or entities appear in the scene, the performance of existing methods is poor. To overcome this challenge, the authors propose a new framework based on sequence generation, using pre - trained Vision - Language Models (VLMs) to generate open - vocabulary scene graphs. This method can not only generate scene graphs containing known and new visual - relationship triples, but also enhance the performance of downstream visual - language tasks through explicit relationship modeling. ### Main contributions of the paper: 1. **Propose a new framework**: This framework can solve the SGG problem in a more general open - vocabulary setting, that is, directly generate scene graphs containing known and new visual - relationship triples from image pixels. 2. **Introduce scene - graph prompts and relation - aware transformation modules**: These components enable the model to learn and generate scene graphs more efficiently. 3. **Achieve excellent performance on multiple benchmark datasets**: This framework performs well in the generalized open - vocabulary SGG benchmark tests and also obtains significant performance improvements in downstream visual - language tasks. ### Specific problems solved: - **Generate scene graphs containing new visual relationships**: Existing SGG methods can usually only handle a limited number of visual - relationship categories and are difficult to cope with the diversity and complexity in the real world. The method in this paper can generate scene graphs containing new visual relationships by using pre - trained VLMs. - **Enhance the performance of downstream visual - language tasks**: By generating high - quality scene graphs, this method can provide richer structured information for downstream tasks (such as visual question answering, image captioning, etc.), thereby improving the performance of these tasks. ### Technical details: - **Scene - graph sequence generation**: By designing specific scene - graph prompts, the scene - graph generation task is transformed into an image - to - text generation task, and pre - trained VLMs are used to generate scene - graph sequences. - **Relation - triple construction**: Extract the position and category information of entities from the generated scene - graph sequences to construct the final scene graph. - **Adapt to downstream visual - language tasks**: By fine - tuning VLMs, transfer the knowledge of scene - graph generation to other visual - language tasks to improve the performance of these tasks. ### Experimental results: - **Perform excellently on multiple SGG benchmark datasets**: For example, on the Visual Genome, Panoptic Scene Graph Generation and OpenImages V6 datasets, the performance of this method in the open - vocabulary setting is significantly better than that of existing methods. - **Achieve consistent improvements in downstream visual - language tasks**: By transferring the knowledge of scene - graph generation to other tasks, this method also performs well in visual question answering, image captioning and other tasks. In conclusion, this paper proposes an innovative framework, solves the problem of generating high - quality scene graphs in the open - vocabulary scenario, and demonstrates its effectiveness and superiority in multiple tasks.