RelScene: A Benchmark and Baseline for Spatial Relations in Text-Driven 3D Scene Generation

Zhaoda Ye,Xinhan Zheng,Yang Liu,Yuxin Peng
DOI: https://doi.org/10.1145/3664647.3681653
2024-01-01
Abstract:Text-driven 3D indoor scene generation aims to automatically generate and arrange the objects, which form a 3D scene that accurately captures the semantics detailed in the given text description. Recent works have shown the potential to generate 3D scenes guided by specific object categories and room layouts but lack a robust mechanism to maintain consistent spatial relationships in alignment with the provided text description during the 3D scene generation. Besides, the annotations of the object and relationships of the 3D scenes are usually time- and cost-consuming, which are not easily obtained for the model training. Thus, in this paper, we conduct a dataset and benchmark for assessing spatial relations in text-driven 3D scene generation, which contains a comprehensive collection of 3D scenes, including textual descriptions, annotating object spatial relations, and providing both template and free-form natural language descriptions. We also provide a pseudo description feature generation method to address the 3D scenes without language annotations. We design an aligned latent space for spatial relation in 3D scenes and text description, in which we can sample the features according to the spatial relation for the few-shot learning. We also propose new metrics to investigate the ability of the approach to generate correct spatial relationships among objects.
What problem does this paper attempt to address?