Abstract:We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control. Layout control has been widely studied to alleviate the shortcomings of T2I diffusion models in understanding objects' placement and relationships from text descriptions. Nevertheless, existing approaches for layout control are limited to 2D layouts, require the user to provide a static layout beforehand, and fail to preserve generated images under layout changes. This makes these approaches unsuitable for applications that require 3D object-wise control and iterative refinements, e.g., interior design and complex scene generation. To this end, we leverage the recent advancements in depth-conditioned T2I models and propose a novel approach for interactive 3D layout control. We replace the traditional 2D boxes used in layout control with 3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation process, where at each stage, the user can insert, change, and move an object in 3D while preserving objects from earlier stages. We achieve this through our proposed Dynamic Self-Attention (DSA) module and the consistent 3D object translation strategy. Experiments show that our approach can generate complicated scenes based on 3D layouts, boosting the object generation success rate over the standard depth-conditioned T2I methods by 2x. Moreover, it outperforms other methods in comparison in preserving objects under layout changes. Project Page: \url{<a class="link-external link-https" href="https://abdo-eldesokey.github.io/build-a-scene/" rel="external noopener nofollow">this https URL</a>}

Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors

Learning 3 D Scene Synthesis from Annotated RGB-D Images

Toward Scene Graph and Layout Guided Complex 3D Scene Generation

SceneTeller: Language-to-3D Scene Generation

Scene-Conditional 3D Object Stylization and Composition

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Generating 3D People in Scenes Without People

Disentangled 3D Scene Generation with Layout Learning

SceneMotifCoder: Example-driven Visual Program Learning for Generating 3D Object Arrangements

MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text

Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

SceneSeer: 3D Scene Design with Natural Language

GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting

Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

Learning 3D Object Shape and Layout without 3D Supervision

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization