Abstract:Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR. Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and further enhance scene consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into the layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, our composition design permits decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open-source Stable Diffusion model, CompoNeRF generates multi-object scenes with high fidelity. Remarkably, our framework achieves up to a \textbf{54\%} improvement by the multi-view CLIP score metric. Our user study indicates that our method has significantly improved semantic accuracy, multi-view consistency, and individual recognizability for multi-object scene generation.

Disentangled 3D Scene Generation with Layout Learning

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

3D-Aware Image Synthesis Via Learning Structural and Textural Representations

Learning 3 D Scene Synthesis from Annotated RGB-D Images

SceneCraft: Layout-Guided 3D Scene Generation

Lay-A-Scene: Personalized 3D Object Arrangement Using Text-to-Image Priors

Layout-your-3D: Controllable and Precise 3D Generation with 2D Blueprint

DM-NeRF: 3D Scene Geometry Decomposition and Manipulation from 2D Images

CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout

Neural Rendering in a Room: Amodal 3D Understanding and Free-Viewpoint Rendering for the Closed Scene Composed of Pre-Captured Objects

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization

DIScene: Object Decoupling and Interaction Modeling for Complex Scene Generation

Exploring 3D-aware Latent Spaces for Efficiently Learning Numerous Scenes

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes

Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

LayoutTransformer: Layout Generation and Completion with Self-attention