Abstract:The objective of text-driven 3D indoor scene generation is to automatically generate and arrange the objects to form a 3D scene that accurately captures the semantics detailed in the given text description. Existing approaches are mainly guided by specific object categories and room layout to generate and position objects like furniture within 3D indoor scenes. However, few methods harness the potential of the text description to precisely control both spatial relationships and object combinations. Consequently, these methods lack a robust mechanism for determining accurate object attributes necessary to craft a plausible 3D scene that maintains consistent spatial relationships in alignment with the provided text description. To tackle these issues, we propose the Memory-Augmented Auto-regressive Network (MAAN), which is a text-driven method for synthesizing 3D indoor scenes with controllable spatial relationships and object compositions. Firstly, we propose a memory-augmented network to help the model decide the attributes of the objects, such as 3D coordinates, rotation and size, which improves the consistency of the object spatial relations with text descriptions. Our approach constructs a memory context to select relevant objects within the scene, which provides spatial information that aids in generating the new object with the correct attributes. Secondly, we develop a prior attribute prediction network to learn how to generate a complete scene with suitable and reasonable object compositions. This prior attribute prediction network adopts a pre-training strategy to extract composition priors from existing scenes, which enables the organization of multiple objects to form a reasonable scene and keeps the object relations according to the text descriptions. We conduct experiments on three different room types (bedroom, living room, and dining room) on the 3DFRONT dataset. The results of these experiments underscore the accuracy of our method in governing spatial relationships among objects, showcasing its superior flexibility compared to existing techniques.

RelScene: A Benchmark and Baseline for Spatial Relations in Text-Driven 3D Scene Generation

A Novel Semantic Annotation Algorithm for Models Based on Associated Scene.

Learning 3 D Scene Synthesis from Annotated RGB-D Images

Generating Visual Spatial Description via Holistic 3D Scene Understanding

Space3D-Bench: Spatial 3D Question Answering Benchmark

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

3D Scene Graph Generation from Point Clouds

MAAN: Memory-Augmented Auto-regressive Network for Text-driven 3D Indoor Scene Generation

T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

Fast 3D Indoor Scene Synthesis by LearningSpatial Relation Priors of Objects

FastScene: Text-Driven Fast 3D Indoor Scene Generation via Panoramic Gaussian Splatting

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Weakly-Supervised 3D Spatial Reasoning for Text-based Visual Question Answering

Fast 3D Indoor Scene Synthesis by Learning Spatial Relation Priors of Objects

Holistic Understanding of 3D Scenes as Universal Scene Description

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning

Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling

A Comprehensive Benchmark and Spatial Data Generation Framework

SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving

Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset