Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

Yian Wang,Xiaowen Qiu,Jiageng Liu,Zhehuan Chen,Jiting Cai,Yufei Wang,Tsun-Hsuan Wang,Zhou Xian,Chuang Gan

2024-11-15

Abstract:Creating large-scale interactive 3D environments is essential for the development of Robotics and Embodied AI research. Current methods, including manual design, procedural generation, diffusion-based scene generation, and large language model (LLM) guided scene design, are hindered by limitations such as excessive human effort, reliance on predefined rules or training datasets, and limited 3D spatial reasoning ability. Since pre-trained 2D image generative models better capture scene and object configuration than LLMs, we address these challenges by introducing Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. In detail, we utilize foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene. This iterative structure brings the flexibility for our method to generate or refine scenes from various starting points, such as text, floor plans, or pre-arranged environments.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate large - scale, diverse 3D interactive environments with realistic details to promote research in robotics and embodied AI. Current methods, such as manual design, procedural generation, diffusion - based scene generation, and large - language - model (LLM) - guided scene design, have limitations such as requiring a large amount of manpower, relying on predefined rules or training data sets, and having limited 3D spatial reasoning abilities. These problems make it difficult to generate scenes with high complexity and a sense of reality. To address these challenges, the paper proposes the ARCHITECT framework, which utilizes diffusion - based 2D inpainting techniques to create complex 3D interactive environments. Specifically, ARCHITECT achieves this goal through the following steps: 1. **Initialization Module**: Select a view in the scene, use a renderer to generate an image, and generate an inpainting mask. 2. **Hierarchical Inpainting Module**: Use a large - language model to generate text prompts, and perform image inpainting in combination with the image and inpainting mask provided in the previous step. 3. **Visual Perception Module**: Identify and segment objects, estimate their depths, back - project them into 3D space, and output the 3D bounding boxes of each object. 4. **Placement Module**: Place the objects into the simulation environment according to the 3D bounding boxes, and return to the initialization stage to continue generating new objects. Through these steps, ARCHITECT can generate detailed and highly interactive 3D scenes at multiple scales, thereby overcoming the limitations of existing methods. Experimental results show that ARCHITECT outperforms existing methods in generating more complex and realistic interactive 3D scenes.

Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Learning 3 D Scene Synthesis from Annotated RGB-D Images

Action-driven 3D Indoor Scene Evolution

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

SceneCraft: Layout-Guided 3D Scene Generation

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

DIScene: Object Decoupling and Interaction Modeling for Complex Scene Generation

GRAINS: Generative Recursive Autoencoders for INdoor Scenes

Interactive3D: Create What You Want by Interactive 3D Generation

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

Simple and effective synthesis of indoor 3d scenes

iControl3D: An Interactive System for Controllable 3D Scene Generation

ArchComplete: Autoregressive 3D Architectural Design Generation with Hierarchical Diffusion-Based Upsampling