Abstract:Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR. Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and further enhance scene consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into the layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, our composition design permits decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open-source Stable Diffusion model, CompoNeRF generates multi-object scenes with high fidelity. Remarkably, our framework achieves up to a \textbf{54\%} improvement by the multi-view CLIP score metric. Our user study indicates that our method has significantly improved semantic accuracy, multi-view consistency, and individual recognizability for multi-object scene generation.

M^2DNeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields

${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields

DSEM-NeRF: Multimodal feature fusion and global-local attention for enhanced 3D scene reconstruction

DM-NeRF: 3D Scene Geometry Decomposition and Manipulation from 2D Images

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis

Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model

ED-NeRF: Efficient Text-Guided Editing of 3D Scene with Latent Space NeRF

Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields

DaRF: Boosting Radiance Fields from Sparse Inputs with Monocular Depth Adaptation

CompoNeRF: Text-guided Multi-object Compositional NeRF with Editable 3D Scene Layout

DiscoNeRF: Class-Agnostic Object Field for 3D Object Discovery

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Open-NeRF: Towards Open Vocabulary NeRF Decomposition

DATENeRF: Depth-Aware Text-based Editing of NeRFs

$C^{3}$-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual Neural Radiance Fields

Connecting NeRFs, Images, and Text

MD-NeRF: Enhancing Large-Scale Scene Rendering and Synthesis With Hybrid Point Sampling and Adaptive Scene Decomposition

Multi-tiling Neural Radiance Field (NeRF) -- Geometric Assessment on Large-scale Aerial Datasets

MultiPlaneNeRF: Neural Radiance Field with Non-Trainable Representation

Compressible-composable NeRF via Rank-residual Decomposition