Abstract:The objective of text-driven 3D indoor scene generation is to automatically generate and arrange the objects to form a 3D scene that accurately captures the semantics detailed in the given text description. Existing approaches are mainly guided by specific object categories and room layout to generate and position objects like furniture within 3D indoor scenes. However, few methods harness the potential of the text description to precisely control both spatial relationships and object combinations. Consequently, these methods lack a robust mechanism for determining accurate object attributes necessary to craft a plausible 3D scene that maintains consistent spatial relationships in alignment with the provided text description. To tackle these issues, we propose the Memory-Augmented Auto-regressive Network (MAAN), which is a text-driven method for synthesizing 3D indoor scenes with controllable spatial relationships and object compositions. Firstly, we propose a memory-augmented network to help the model decide the attributes of the objects, such as 3D coordinates, rotation and size, which improves the consistency of the object spatial relations with text descriptions. Our approach constructs a memory context to select relevant objects within the scene, which provides spatial information that aids in generating the new object with the correct attributes. Secondly, we develop a prior attribute prediction network to learn how to generate a complete scene with suitable and reasonable object compositions. This prior attribute prediction network adopts a pre-training strategy to extract composition priors from existing scenes, which enables the organization of multiple objects to form a reasonable scene and keeps the object relations according to the text descriptions. We conduct experiments on three different room types (bedroom, living room, and dining room) on the 3DFRONT dataset. The results of these experiments underscore the accuracy of our method in governing spatial relationships among objects, showcasing its superior flexibility compared to existing techniques.

Scene Graph Masked Variational Autoencoders for 3D Scene Generation

3D scene generation from scene graphs and self-attention

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

GRAINS: Generative Recursive Autoencoders for INdoor Scenes

Structured Graph Variational Autoencoders for Indoor Furniture layout Generation

Indoor Scene Generation from a Collection of Semantic-Segmented Depth Images

Graph Neural Network for Generative Furniture Arrangement

Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling

UGMAE: A Unified Framework for Graph Masked Autoencoders

RARE: Robust Masked Graph Autoencoder

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization

Hierarchy Denoising Recursive Autoencoders for 3D Scene Layout Prediction

MAAN: Memory-Augmented Auto-regressive Network for Text-driven 3D Indoor Scene Generation

UniM^2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

3D Scene Graph Generation from Point Clouds

Automatic Generation of 3D Scene Animation Based on Dynamic Knowledge Graphs and Contextual Encoding

View Synthesis of Dynamic Scenes based on Deep 3D Mask Volume