Abstract:Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs' inherent capabilities to the SGG task, we introduce a two-stage training paradigm. Experiments show that LLaVA-SpaceSGG outperforms other open-vocabulary SGG methods, boosting recall by 8.6% and mean recall by 28.4% compared to the baseline. Our codebase, dataset, and trained models are publicly accessible on GitHub at the following URL: <a class="link-external link-https" href="https://github.com/Endlinc/LLaVA-SpaceSGG" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main challenges in Scene Graph Generation (SGG): 1. **Open - vocabulary SGG**: - Existing SGG methods usually require direct supervision and are trained with a fixed set of labels, resulting in insufficient generalization ability on open - set images. This means that these models have difficulty dealing with unseen objects or relationships. 2. **Lack of spatial relations**: - Existing SGG datasets are mainly based on 2D image annotations, focusing on common relationships between objects while ignoring 3D spatial relationships. For example, spatial relationships such as front - back and up - down between certain objects are not fully considered. To solve these two problems, the authors propose **LLaVA - SpaceSGG**, which is a multimodal large - language model (MLLM) specifically designed for open - vocabulary scene graph generation and enhanced in modeling spatial relationships. ### Solutions To address the above challenges, the authors take the following measures: 1. **Construct the SpaceSGG dataset**: - The SpaceSGG dataset combines publicly available datasets and data synthesized by open - source models, including object positions, object relationships, and depth information. The data format includes three forms: spatial SGG descriptions, question - answering, and dialogue, to enhance the model's spatial reasoning ability. 2. **Introduce a two - stage training paradigm**: - Stage 1: Align the image model (such as CLIP) and the text model so that the model can perform well in open - vocabulary SGG tasks. - Stage 2: Refine the model's understanding of region - level spatial relationships, especially in complex visual environments. 3. **Fuse 2D and 3D information**: - The dataset not only contains planar coordinates but also introduces depth coordinates, enriching the spatial relationships between objects (such as front - back relationships). Specific steps include using depth - estimation algorithms to generate depth maps, constructing 3D scenes, and extracting 3D SGG from them. 4. **Generate diverse data formats**: - The dataset generates data in three different formats: spatial descriptions (SpaceSGG - Desc), single - turn question - answering (SpaceSGG - QA), and multi - turn dialogue (SpaceSGG - Conv), to enhance the model's spatial understanding ability. ### Experimental results Experiments show that LLaVA - SpaceSGG significantly outperforms existing methods on the PSG validation set, with an 8.6% improvement in recall rate and a 28.4% improvement in average recall rate. In addition, in the newly proposed spatial - relations validation set, this model also shows higher accuracy. ### Summary LLaVA - SpaceSGG significantly improves the model's performance in complex visual tasks, especially in capturing and predicting rich spatial relationships, by introducing high - quality spatial and scene - graph data and an innovative two - stage training paradigm. --- If you have more questions or need further explanations, please feel free to let me know!

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations